Metron - Capacity Management: September 2013

Monday, 30 September 2013

Maturing your Capacity Management processes ( 1/11)

Mature capacity management whilst difficult to achieve can provide enormous value to an organisation.

Regardless of the technology; the methods and techniques employed in capacity management are essential in optimizing the IT expenditure, reducing the risk associated with IT change and ultimately ensuring that the IT resource meets the needs of the business.

One of the biggest challenges in implementing a new capacity management process or maturing an existing process is aligning with the business.

This alignment ultimately provides the key information required to understand “the capacity” of an enterprise and plan accordingly.

This is an essential step in implementing and providing mature capacity management, but the majority of organizations have yet to achieve this and are still very much focused at the component level of capacity management.

This is true across all market sectors with these organizations exhibiting the following common traits:

· Business view IT capacity as infinite

· Lots of data available, but not used to its full potential

· Capacity management has a reactive focus

· Planning purely based on historical data

· Any capacity management is happening in technical silo’s

In this blog series I’ll aim to address some of the challenges faced when implementing business level capacity.

Don't forget to sign up for my forthcoming webinars http://www.metron-athene.com/services/training/webinars/index.html

Rob Ford

Principal Consultant

Wednesday, 25 September 2013

Capacity Management for Big Data - Managing and Monitoring a Hadoop Cluster (3/3)

Now that we have a better idea of what Hadoop is and how it’s used to manage Big Data, the most important thing to put in place is a mechanism to monitor the performance of the cluster and the nodes within the cluster.

Hadoop is supported on Linux and Windows (it can run over other operating systems, such as BSD, Mac OS/X, and OpenSolaris). Existing utilities and performance counters exist for those operating systems, which means that athene® can be implemented to capture performance and capacity data from those nodes and that data can be stored in the organization’s Capacity Management Information Store (CMIS).

The Capacity Manager, as in all other environments, is interested in a number of things:

(1) How is the cluster and how are the nodes within the cluster performing now? Performance data that shows how much CPU, memory, network resources, and disk I/O are used are readily available and can be stored and reported upon within athene®. Web reporting and web dashboards can easily show the Capacity Manager the health of the cluster nodes and automatic alerting can quickly point the Capacity Manager to exceptions that can be the source of performance problems in the environment.

(2) What are the trends for the important metrics? Big data environments typically are dealing with datasets that are increasing in size – as those data sets increase, the amount of processing of that data tends to increase, as well. The Capacity Manager must keep a close eye out – a healthy cluster today could be one with severe performance bottlenecks tomorrow. Trend alerting is built into athene® and can alert the Capacity Manager that performance thresholds will be hit in the future, allowing ample time to plan changes to the environment to handle the increased load predicted in the future.

(3) Storage space is certainly something that cannot be forgotten. With DAS, data is distributed and replicated across many nodes. It’s important to be able to take this storage space available in a Hadoop cluster and represent it in a way that quickly shows how much headroom is available at a given time and how the amount of disk space used trends over time. athene® can easily aggregate the directly attached storage disks to give a big-picture view of disk space available as well as the amount of headroom. These reports can show how disk space is used over time, as well. Trend reports and alerting can quickly alert the Capacity Manager when free storage is running low.

(4) Finally, the ability to size and predict necessary changes to the environment as time goes on. As with any other environment, a shortage in one key subsystem can affect the entire environment. The ability to model future system requirements based on business needs and other organic growth is vital for the Capacity Manager. With athene®, it’s easy to see how trends are affecting future needs and it’s equally easy to model expected requirements based on predicted changes to the workload.

As the price of data storage continues to decrease and the amount of data continues to increase, it becomes even more vital that organizations with Big Data implementations closely manage and monitor their environments to ensure that service levels are met and adequate capacity is always available.

We'll be taking a look at some of the typical metrics you should be monitoring in our Capacity Management and Big Data webinar(Part 2) tomorrow -register now and don't worry if you missed Part 1, join our Community and catch it on-demand

Rich Fronheiser

Chief Marketing Officer

Monday, 23 September 2013

Capacity Management of Big Data -Technology (2/3)

A technology for Big Data – Aptly Named for an Elephant

There are many different commercial applications on the marketplace for Big Data control and management. Many of them are built on top of an open source technology known as Apache Hadoop, named after the toy elephant of one of the developer’s children. Hadoop may seem new to many who are hearing about it for the first time, but the framework was developed over the course of the last 8-10 years. Yahoo! claimed the largest (at the time) production rollout of Hadoop in 2008 when its Web Search platform was built using the technology.

Since then, Hadoop (and its associated Hadoop Distributed File System (HDFS), which we’ll look at a bit later) has been implemented at many other companies that are dealing with Big Data and the storage and processing challenges that go with it. The success of many of these companies’ businesses hinges on successful handling of Big Data. The include Amazon.com, Google, LinkedIn, Microsoft, Twitter, and Facebook. Facebook’s cluster was announced in June 2012 to be 100 petabytes and by November 2012 was growing by 500 terabytes per day.

Big Data – how do we store all of it?

Let’s start with storage. Storage cost, performance, and reliability are crucial when it comes to any Big Data implementation.

This seems obvious, but there are some interesting observations that can help the Capacity Manager understand why Hadoop and the HDFS are architected the way they are.

First, the HDFS that underpins Hadoop and most Big Data implementations is designed so that data is distributed across multiple nodes and replicated multiple times (typically at least 3) to account for node failure and help provide high levels of performance. Unlike most storage implementations today, Hadoop’s HDFS is open source and traditionally uses direct attached storage (DAS), so an organization can implement with no licensing or support costs.

Also, HDFS has incredibly high bandwidth – the distributed nature of the data combined with the use of existing low cost, shared networks means that in a typical Hadoop implementation, clusters can read/write more than a terabyte of data per second continuously to the processing layer.

Finally, the replication of the data multiple times across nodes using DAS means that it’s unlikely there will be availability issues due to a node failure. HDFS has been used successfully in many different sized organizations with many different sizes of cluster.

HDFS was designed to be inexpensive, perform extremely well, and be very reliable.

But as the Capacity Manager, surely you need to be able to properly size storage for a Hadoop environment. Let’s say you have 100 TB of data that you need to store. Keep in mind that Hadoop’s default replication is 3 times, so you’d need to have 300 TB just for the user data involved. It’s also necessary to have an additional amount of disk space equal to the data requirement for temporary files and processing space. So that 100 TB requirement is now 400 TB.

Hadoop data can be compressed, which will reduce some of the disk space requirements (and will increase the CPU cycles needed to a certain degree). However, it’s quite a bit easier to simply size without considering compression, at least initially until you decide what data you will compress and how that will reclaim some of the disk space you’ve allocated.

And of course, you’ll need to consider that it’s quite possible that your data requirements will increase fairly rapidly if you intend to retain a lot of this data – and as the capacity manager you need to stay in front of that requirement.

Other “typical” Capacity Management factors are important, too

Speed of data processing is very important in a Big Data environment, so the number of nodes in your cluster will not necessarily be determined by the disk space provided by the DAS. In fact, it’s possible that you’ll configure more storage than “necessary” in order to provide adequate CPU processing power for you to meet your end-users requirements.

Likewise, network throughput and bandwidth is something that’s frequently overlooked in today’s high bandwidth, high throughput environments, but because the nodes in a Hadoop cluster work by communicating frequently with one another, it’s vital that they be connected in such a way that there’s low latency and high throughput. Network traffic and congestion tends to depend on the types of work being done and is usually limited by the slowest devices involved.

Hadoop and its associated tasks require a lot of memory. Large regions of data are kept in memory (the whole table would be kept in memory, but it’s impossible to do that with tables that are frequently over a terabyte in size). The processing itself requires gigabytes of memory per task. It’s important that organizations not try to skimp on memory in the nodes used to set up clusters – this will slow down processing dramatically.

Register now for Capacity Management Big Data Part 2 taking place this Thursday 26th and don't worry if you missed part 1 you can catch up with our on-demand webinars.

On Wednesday I'll conclude with managing and monitoring a Hadoop Cluster.

Rich Fronheiser

Chief Marketing Officer

Thursday, 19 September 2013

Big Data and Capacity Management (1/3)

Big Data has received Big Attention from a lot of different people, and not only in the IT world. But why? Let’s examine what Big Data is first and then talk about why it’s important that Capacity Managers carefully consider the implications of Big Data.

What is Big Data?

Big Data typically refers to large sets of data that (choose one or many):

(1) Require processing power that exceeds what you’d get with a legacy database

(2) Are too big to be handled by one system in a traditional way (*)

(3) Arrives much too quickly or in too great a volume to be handled in the traditional way

(4) Are of such varied types that it makes little sense to try to store or handle them in a traditional way

(*) by “Traditional Way” I mean handled by a typical relational database. Even there, the definition of “typical” is somewhat elusive due to the complexities of today’s databases.

Why are we concerned about doing things in a new way?

(1) 90% of all data stored today was created in the last 2 years

Frightening concept, isn’t it? Imagine if this trend continues over the next decade. Storage costs have dropped consistently – statistics I’ve seen reflect that storage amounts per dollar have halved just about annually over the last decade. And yet, it doesn’t take an advanced math degree to see that companies will need to make considerable investments in storage space to keep up with the demands of Big Data. Another estimate from IBM is that over 2.5 exabytes (an exabyte is a million terabytes or a billion gigabytes) of data is created every day.

OK, but what *is* Big Data?

Big data generally falls into 3 categories – the traditional enterprise data we’re all used to, automatically generated data (from call logs, smart meters, facility/manufacturing sensors, and GPS data, among others), and social data that can include valuable customer feedback from sites like Facebook and Twitter, among others.

Companies have realized in the past few years that these last two types – non-traditional data – can be mined for incredibly useful information. However, this data is not well structured and is too voluminous to be handled in traditional ways. In the past, the technology to mine this data either didn’t exist or was prohibitively expensive and didn’t provide a clear return on investment. In those days, this data would’ve likely been thrown out or ignored and allowed to be deleted once it became a certain age, unlikely to receive a second (or even a first) look.

Not only is the data valuable, but much of it is perishable, meaning it’s crucial that the data be used almost immediately or the value of the data decreases dramatically or even ceases to exist.

What’s changed?

Companies are starting to analyze and look at this data more closely now. Many things have changed to make this happen – I’ll list the top four, in my estimation:

(1) Storage costs have decreased dramatically. As I mentioned earlier, the cost of disk has halved annually for about the last decade. Whereas GB and TB of storage was extremely expensive 5-10 years ago and closely controlled, the cost to store additional data (such as Big Data) has dropped to make it more affordable.

(2) Processing / Computation costs have decreased dramatically. Combined with new technologies that allow processing to be distributed across many (relatively) cheap nodes (see #3, below), it’s extremely easy to put processing power in place to manipulate and analyze these huge, varied sets of data. And do it quickly, while the data still has value to the business.

(3) New technologies have enabled the transformation of huge sets of data into valuable business information. One of the leaders in the processing of Big Data is an open source technology called Apache Hadoop.

(4) Social Media outlets such as Facebook, Twitter, and LinkedIn that allow customers to quickly, easily, and sometimes semi-anonymously provide feedback directly to a company. The company doesn’t necessarily control the location where the feedback is given, so it’s crucial that the company is able to quickly and easily mine for relevant data. This is especially true when it comes to sites like Twitter, where data is especially perishable and trends can come and go very quickly.

Dale's webinar series on Big Data and the implications for Capacity Management is worth attending, if you missed part 1 then join our Community and listen to it now http://www.metron-athene.com/_downloads/on-demand-webinars/index.html and don't forget to sign up for Capacity Management for Big Data Part 2 http://www.metron-athene.com/services/training/webinars/index.html

Join me again on Monday when I'll be taking a look at a technology for Big Data.

Rich Fronheiser

Chief Marketing Officer

Tuesday, 17 September 2013

Big data Overview - MapReduce processing (4/4)

In the final part of my series I’ll talk about MapReduce processing.

Distributing workload through MapReduce processing is a key factor in Big Data scalability.

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).

MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The first complete end-to-end framework for MapReduce on top of Apache Hadoop was done within the Advanced Research Group of Unisys under Dr. Sumeet Malhotra^[4]

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative.

While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:

Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the K1 input key value each processor would work on, and provides that processor with all the input data associated with that key value

Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.

"Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor would work on, and provides that processor with all the Map-generated data associated with that key value.

Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.

Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

Logically these 5 steps can be thought of as running in sequence – each step starts only after the previous step is completed – though in practice, of course, they can be intertwined, as long as the final result is not affected.

In many situations the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as much as possible local to the Map-generated data they need to process.

In this series I’ve given you an ‘overview’ of Big Data and I’ll be looking at the implications of Big data for Capacity Management, part 2 of my Big Data series, on Wednesday 26 in a live webinar.

If you missed Part 1 don’t worry, register for our Community and listen to it now http://www.metron-athene.com/_downloads/on-demand-webinars/index.html

Dale Feiste

Principal Consultant

Friday, 13 September 2013

Big Data Overview - The Big Data Threshold (3/4)

Four properties of data storage and retrieval can be evaluated to help determine when Big Data technology may be useful.

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even Petabytes—of information.
Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
Scrutinize 5 million trade events created each day to identify potential fraud
Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
Monitor 100’s of live video feeds from surveillance cameras to target points of interest
Exploit the 80% data growth in images, video and documents to improve customer satisfaction

Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.

Next Tuesday I’ll discuss MapReduce Processing and on Wednesday 26 I’ll be broadcasting part 2 of my Big Data series live, register for this event http://www.metron-athene.com/services/training/webinars/index.html

If you missed Part 1 register for our Community and listen to it now http://www.metron-athene.com/_downloads/on-demand-webinars/index.html

Dale Feiste
Principal Consultant

Wednesday, 11 September 2013

Big Data Overview - Where does Big Data exist? (2/4)

With exponential data growth everywhere, Big Data is showing up in all business, including manufacturing. Social networks, search engines, photography archives, government, e-commerce, and many areas of science...

What’s in it for manufacturers?

Manufacturing supply chain managers dream of a quick-response, demand-based, integrated, multi-level supply network. That would entail tracking every SKU’s status and pathways from raw material through production, distribution, transportation, and delivery of the final product.

That dream is becoming a reality.

Now it’s possible to constantly monitor a flow of data, making exception-handling faster and more effective. In addition, the ability to monitor factors like defect ratios and on-time delivery can help with supplier selection and performance assessment.

In manufacturing operations, synchronizing demand forecasting with production planning continues to be a struggle. Since actual orders will always differ from predicted demand, it’s a matter of having time to respond. With transparent multi-level integrated supply chain data, there’s better advance notice for adjusting plans and material supply.

Inside the factory, having the ability to utilize the mass of both order and machine status data allows production managers better operations optimization, factory scheduling, routings, production levelling, maintenance planning, and workforce scheduling and deployment.

Catch up with me again on Friday when I’ll take a look at the Big Data threshold.

In the meantime sign up for my free seminar series Capacity Management for Big Data, starting tomorrow September 12 http://www.metron-athene.com/services/training/webinars/index.html

Dale Feiste
Principal Consultant

Monday, 9 September 2013

Big Data Overview (1 of 4)

I’ll be running a webinar on September 12 looking at the impact of Big Data from a capacity management perspective and so I thought it would be good to share an overview of Big Data with you, starting today with the terminology used and what it means.

Jargon related to Big Data is new to many people in IT and the list below explains the more common terms you may see.

Hadoop

An open source software library project administered by the Apache Software Foundation. Apache defines Hadoop as “a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.”

HDFS

Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.

Hbase

A distributed columnar NoSQL database e.g. the Hadoop database

Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Map/Reduce

A general term that refers to the process of breaking up a problem into pieces that are then distributed across multiple computers on the same network or cluster, or across a grid of disparate and possibly geographically separated systems (map), and then collecting all the results and combines them into a report (reduce). Google’s branded framework to perform this function is called MapReduce.

Mashup

The process of combining different datasets within a single application to enhance output, for example, combining demographic data with real estate listings.

On Wednesday I’ll be looking at where Big Data exists, in the meantime don’t forget to register for my free webinar http://www.metron-athene.com/services/training/webinars/index.html

Dale Feiste

Principal Consultant