must hadoop finish maps before reduce?
My lecturer at the university said that (Hadoop) reduce operations can only start when all map operations finished.
This is in contrast with the output of a map-reduce stream operation that clearly sometimes shows:
map 80% reduce 13% map 80% reduce 27% and then map 100% reduce 27% . . map 100% reduce 100%
(I have a map reduce three node cluster in my home and I've ran a few streaming jobs).
What does the output mean, given that my lecturer knows what he is talking about? What state is the job in when reduce has started but map didn't complete?
There are 3 steps in Reduce phase:
1) copy (data to reducers)
2) sort (or more exactly merge)
3) reduce (execution of Reduce()).
Reducers can start coping data from a Mapper when this Mapper completes its execution.
By default, schedulers wait until 5% of the map tasks in a job have completed before scheduling reduce tasks for the same job. For large jobs this can cause problems with cluster utilization, since they take up reduce slots while waiting for the map tasks to complete. Setting mapred.reduce.slowstart.completed.maps to a higher value, such as 0.80 (80%), can help improve throughput.