Monday, 23 September 2013

Capacity Management of Big Data -Technology (2/3)

A technology for Big Data – Aptly Named for an Elephant

There are many different commercial applications on the marketplace for Big Data control and management.  Many of them are built on top of an open source technology known as Apache Hadoop, named after the toy elephant of one of the developer’s children.  Hadoop may seem new to many who are hearing about it for the first time, but the framework was developed over the course of the last 8-10 years.  Yahoo! claimed the largest (at the time) production rollout of Hadoop in 2008 when its Web Search platform was built using the technology.

Since then, Hadoop (and its associated Hadoop Distributed File System (HDFS), which we’ll look at a bit later) has been implemented at many other companies that are dealing with Big Data and the storage and processing challenges that go with it.   The success of many of these companies’ businesses hinges on successful handling of Big Data.  The include Amazon.com, Google, LinkedIn, Microsoft, Twitter, and Facebook.  Facebook’s cluster was announced in June 2012 to be 100 petabytes and by November 2012 was growing by 500 terabytes per day.
Big Data – how do we store all of it?

Let’s start with storage.  Storage cost, performance, and reliability are crucial when it comes to any Big Data implementation.

This seems obvious, but there are some interesting observations that can help the Capacity Manager understand why Hadoop and the HDFS are architected the way they are.
First, the HDFS that underpins Hadoop and most Big Data implementations is designed so that data is distributed across multiple nodes and replicated multiple times (typically at least 3) to account for node failure and help provide high levels of performance.  Unlike most storage implementations today, Hadoop’s HDFS is open source and traditionally uses direct attached storage (DAS), so an organization can implement with no licensing or support costs.

Also, HDFS has incredibly high bandwidth – the distributed nature of the data combined with the use of existing low cost, shared networks means that in a typical Hadoop implementation, clusters can read/write more than a terabyte of data per second continuously to the processing layer.
Finally, the replication of the data multiple times across nodes using DAS means that it’s unlikely there will be availability issues due to a node failure.  HDFS has been used successfully in many different sized organizations with many different sizes of cluster.

HDFS was designed to be inexpensive, perform extremely well, and be very reliable.
But as the Capacity Manager, surely you need to be able to properly size storage for a Hadoop environment.  Let’s say you have 100 TB of data that you need to store.  Keep in mind that Hadoop’s default replication is 3 times, so you’d need to have 300 TB just for the user data involved.  It’s also necessary to have an additional amount of disk space equal to the data requirement for temporary files and processing space.  So that 100 TB requirement is now 400 TB.

Hadoop data can be compressed, which will reduce some of the disk space requirements (and will increase the CPU cycles needed to a certain degree).  However, it’s quite a bit easier to simply size without considering compression, at least initially until you decide what data you will compress and how that will reclaim some of the disk space you’ve allocated.
And of course, you’ll need to consider that it’s quite possible that your data requirements will increase fairly rapidly if you intend to retain a lot of this data – and as the capacity manager you need to stay in front of that requirement.

Other “typical” Capacity Management factors are important, too

Speed of data processing is very important in a Big Data environment, so the number of nodes in your cluster will not necessarily be determined by the disk space provided by the DAS.  In fact, it’s possible that you’ll configure more storage than “necessary” in order to provide adequate CPU processing power for you to meet your end-users requirements.
Likewise, network throughput and bandwidth is something that’s frequently overlooked in today’s high bandwidth, high throughput environments, but because the nodes in a Hadoop cluster work by communicating frequently with one another, it’s vital that they be connected in such a way that there’s low latency and high throughput.  Network traffic and congestion tends to depend on the types of work being done and is usually limited by the slowest devices involved.

Hadoop and its associated tasks require a lot of memory.  Large regions of data are kept in memory (the whole table would be kept in memory, but it’s impossible to do that with tables that are frequently over a terabyte in size).  The processing itself requires gigabytes of memory per task.  It’s important that organizations not try to skimp on memory in the nodes used to set up clusters – this will slow down processing dramatically.
Register now for Capacity Management Big Data Part 2 taking place this Thursday 26th and don't worry if you missed part 1 you can catch up with our on-demand webinars.
On Wednesday I'll conclude with managing and monitoring a Hadoop Cluster.
Rich Fronheiser
Chief Marketing Officer 

No comments:

Post a Comment