Friday 2 January 2015

Big Data Concerns - Top 5 Key Capacity Management Concerns for UNIX/Linux (9 of 12)


So what is Big Data? 
Big Data - Data sets whose size grows beyond the management capabilities of traditional software that has been used in the past.

Vast amounts of information - are now stored specifically when dealing with social media applications, such as Facebook and Twitter.  Therefore the Big Data solution needs to provide support for Hexa- and Peta-bytes of data.
Hadoop (HDFS) - Once such solution is Apache’s Hadoop offering.  Hadoop is an open source software library project administered by the Apache Software Foundation.  Apache defines Hadoop as “a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.”  Using HDFS, data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster by auto-replication. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.

Map/Reduce- A general term that refers to the process of breaking up a problem into pieces that are then distributed across multiple computers on the same network or cluster, or across a grid of disparate and possibly geographically separated systems (map), and then collecting all the results and combining them into a report (reduce). Google’s branded framework to perform this function is called MapReduce.
Not recommended for SAN/NAS - Because the data is distributed across multiple computers whether they are on the same network or disparately separated, it is not recommended to use SAN or NAS storage but local system block storage.

So what should we be monitoring? I'll deal with this next....Happy New Year to you all.

Jamie Baker
Principal Consultant

No comments:

Post a Comment