In the final part of my series I’ll talk about MapReduce
processing.
Distributing workload through MapReduce processing is
a key factor in Big Data scalability.
MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The first complete end-to-end framework for MapReduce on top of Apache Hadoop was done within the Advanced Research Group of Unisys under Dr. Sumeet Malhotra[4]
"Map" step: The master node takes the input, divides it into smaller sub-problems,
and distributes them to worker nodes. A worker node may do this again in turn,
leading to a multi-level tree structure. The worker node processes the smaller problem, and passes
the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and
combines them in some way to form the output – the answer to the problem
it was originally trying to solve.
MapReduce allows for
distributed processing of the map and reduction operations. Provided each
mapping operation is independent of the others, all maps can be performed in
parallel – though in practice it is limited by the number of independent
data sources and/or the number of CPUs near each source. Similarly, a set of
'reducers' can perform the reduction phase - provided all outputs of the map
operation that share the same key are presented to the same reducer at the same
time, or if the reduction function is associative.
While this process can often appear inefficient compared to algorithms
that are more sequential, MapReduce can be applied to significantly larger
datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours.
The parallelism also offers some possibility of recovering from partial failure
of servers or storage during the operation: if one mapper or reducer fails, the
work can be rescheduled – assuming the input data is still available.
Another way to look at
MapReduce is as a 5-step parallel and distributed computation:
Prepare the Map() input – the "MapReduce system" designates Map processors, assigns
the K1 input key value each processor would work on, and provides that
processor with all the input data associated with that key value
Run the user-provided
Map() code – Map() is run exactly once
for each K1 key value, generating output organized by key values K2.
"Shuffle" the
Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key
value each processor would work on, and provides that processor with all the
Map-generated data associated with that key value.
Run the user-provided
Reduce() code – Reduce() is run exactly
once for each K2 key value produced by the Map step.
Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by
K2 to produce the final outcome.
Logically these 5 steps can be
thought of as running in sequence – each step starts only after the previous
step is completed – though in practice, of course, they can be intertwined, as
long as the final result is not affected.
In many situations the input
data might already be distributed ("sharded") among many different
servers, in which case step 1 could sometimes be greatly simplified by
assigning Map servers that would process the locally present input data.
Similarly, step 3 could sometimes be sped up by assigning Reduce processors
that are as much as possible local to the Map-generated data they need to
process.
In this series I’ve given you
an ‘overview’ of Big Data and I’ll be looking at the implications of Big data
for Capacity Management, part 2 of my Big Data series, on Wednesday 26 in a live webinar.
Register now to join me http://www.metron-athene.com/services/training/webinars/index.html
If you missed
Part 1 don’t worry, register for our Community and listen to it now http://www.metron-athene.com/_downloads/on-demand-webinars/index.html
Dale
Feiste
Principal
Consultant
No comments:
Post a Comment