Tuesday 17 September 2013

Big data Overview - MapReduce processing (4/4)


In the final part of my series I’ll talk about MapReduce processing. 

Distributing workload through MapReduce processing is a key factor in Big Data scalability. 
 
MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).

MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The first complete end-to-end framework for MapReduce on top of Apache Hadoop was done within the Advanced Research Group of Unisys under Dr. Sumeet Malhotra[4]

 
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. 
 
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
 
MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative.
While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
Another way to look at MapReduce is as a 5-step parallel and distributed computation:
 
Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the K1 input key value each processor would work on, and provides that processor with all the input data associated with that key value 
 
Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. 
 
"Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor would work on, and provides that processor with all the Map-generated data associated with that key value.
 
Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
 
Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.
 
Logically these 5 steps can be thought of as running in sequence – each step starts only after the previous step is completed – though in practice, of course, they can be intertwined, as long as the final result is not affected.
In many situations the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as much as possible local to the Map-generated data they need to process.
 
In this series I’ve given you an ‘overview’ of Big Data and I’ll be looking at the implications of Big data for Capacity Management, part 2 of my Big Data series, on Wednesday 26 in a live webinar.
 
If you missed Part 1 don’t worry, register for our Community and listen to it now http://www.metron-athene.com/_downloads/on-demand-webinars/index.html
 
Dale Feiste
Principal Consultant

 

No comments:

Post a Comment