What is the main purpose of the shuffle and sort phase in a Hadoop MapReduce job?
Think about what happens between the map and reduce steps.
The shuffle and sort phase collects all intermediate key-value pairs from mappers, groups them by key, and sorts them. This prepares the data so reducers can process all values for each key together.
Given the following intermediate mapper outputs, what will be the grouped and sorted data after the shuffle and sort phase?
Mapper outputs: (apple, 1), (banana, 1), (apple, 1), (banana, 1), (cherry, 1)
Shuffle groups by key and collects all values in a list.
After shuffle and sort, all values for each key are grouped into lists. The keys are sorted alphabetically, so 'apple' comes before 'banana' and 'cherry'.
If a MapReduce job processes 1000 unique keys in the map phase, how many keys will be present in the shuffle and sort phase output?
Shuffle groups keys but does not create new keys or remove unique ones.
The shuffle and sort phase groups all intermediate data by unique keys from the map output. The number of unique keys remains the same.
In a MapReduce job, the reducer receives unsorted keys and values. Which issue in the shuffle and sort phase could cause this?
Sorting happens during shuffle and sort phase.
If keys arrive unsorted at reducers, it means the shuffle and sort phase did not sort keys properly before sending them.
You want to reduce network traffic during the shuffle and sort phase in a large MapReduce job. Which approach will help achieve this?
Think about reducing data volume before shuffle.
Using a combiner reduces the amount of data sent over the network by partially aggregating mapper outputs before shuffle and sort.