A technology for Big
Data – Aptly Named for an Elephant
There are many different commercial applications on the marketplace for Big Data control and management. Many of them are built on top of an open source technology known as Apache Hadoop, named after the toy elephant of one of the developer’s children. Hadoop may seem new to many who are hearing about it for the first time, but the framework was developed over the course of the last 8-10 years. Yahoo! claimed the largest (at the time) production rollout of Hadoop in 2008 when its Web Search platform was built using the technology.
Let’s start with storage. Storage cost, performance, and reliability are crucial when it comes to any Big Data implementation.
Other “typical” Capacity Management factors are important, too
There are many different commercial applications on the marketplace for Big Data control and management. Many of them are built on top of an open source technology known as Apache Hadoop, named after the toy elephant of one of the developer’s children. Hadoop may seem new to many who are hearing about it for the first time, but the framework was developed over the course of the last 8-10 years. Yahoo! claimed the largest (at the time) production rollout of Hadoop in 2008 when its Web Search platform was built using the technology.
Since then, Hadoop (and its associated Hadoop Distributed
File System (HDFS), which we’ll look at a bit later) has been implemented at
many other companies that are dealing with Big Data and the storage and
processing challenges that go with it.
The success of many of these companies’ businesses hinges on successful
handling of Big Data. The include
Amazon.com, Google, LinkedIn, Microsoft, Twitter, and Facebook. Facebook’s cluster was announced in June 2012
to be 100 petabytes and by November 2012 was growing by 500 terabytes per day.
Big Data – how do we
store all of it?Let’s start with storage. Storage cost, performance, and reliability are crucial when it comes to any Big Data implementation.
This seems obvious, but there are some interesting
observations that can help the Capacity Manager understand why Hadoop and the
HDFS are architected the way they are.
First, the HDFS that underpins Hadoop and most Big Data
implementations is designed so that data is distributed across multiple nodes
and replicated multiple times (typically at least 3) to account for node
failure and help provide high levels of performance. Unlike most storage implementations today, Hadoop’s
HDFS is open source and traditionally uses direct attached storage (DAS), so an
organization can implement with no licensing or support costs.
Also, HDFS has incredibly high bandwidth – the distributed
nature of the data combined with the use of existing low cost, shared networks
means that in a typical Hadoop implementation, clusters can read/write more
than a terabyte of data per second continuously to the processing layer.
Finally, the replication of the data multiple times across
nodes using DAS means that it’s unlikely there will be availability issues due
to a node failure. HDFS has been used
successfully in many different sized organizations with many different sizes of
cluster.
HDFS was designed to be inexpensive, perform extremely well,
and be very reliable.
But as the Capacity Manager, surely you need to be able to
properly size storage for a Hadoop environment.
Let’s say you have 100 TB of data that you need to store. Keep in mind that Hadoop’s default
replication is 3 times, so you’d need to have 300 TB just for the user data
involved. It’s also necessary to have an
additional amount of disk space equal to the data requirement for temporary
files and processing space. So that 100
TB requirement is now 400 TB.
Hadoop data can be compressed, which will reduce some of the
disk space requirements (and will increase the CPU cycles needed to a certain
degree). However, it’s quite a bit
easier to simply size without considering compression, at least initially until
you decide what data you will compress and how that will reclaim some of the
disk space you’ve allocated.
And of course, you’ll need to consider that it’s quite
possible that your data requirements will increase fairly rapidly if you intend
to retain a lot of this data – and as the capacity manager you need to stay in
front of that requirement.Other “typical” Capacity Management factors are important, too
Speed of data processing is very important in a Big Data
environment, so the number of nodes in your cluster will not necessarily be
determined by the disk space provided by the DAS. In fact, it’s possible that you’ll configure
more storage than “necessary” in order to provide adequate CPU processing power
for you to meet your end-users requirements.
Likewise, network throughput and bandwidth is something
that’s frequently overlooked in today’s high bandwidth, high throughput
environments, but because the nodes in a Hadoop cluster work by communicating
frequently with one another, it’s vital that they be connected in such a way
that there’s low latency and high throughput.
Network traffic and congestion tends to depend on the types of work
being done and is usually limited by the slowest devices involved.
Hadoop and its associated tasks require a lot of
memory. Large regions of data are kept
in memory (the whole table would be kept in memory, but it’s impossible to do
that with tables that are frequently over a terabyte in size). The processing itself requires gigabytes of
memory per task. It’s important that
organizations not try to skimp on memory in the nodes used to set up clusters –
this will slow down processing dramatically.
Register now for Capacity Management Big Data Part 2 taking place this Thursday 26th and don't worry if you missed part 1 you can catch up with our on-demand webinars.
On Wednesday I'll conclude with managing and monitoring a Hadoop Cluster.
Rich Fronheiser
Chief Marketing Officer
No comments:
Post a Comment