0
0
Hadoopdata~10 mins

MapReduce job execution flow in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - MapReduce job execution flow
Submit Job
Split Input Data
Map Tasks Start
Map Tasks Produce (key,value)
Shuffle and Sort
Reduce Tasks Start
Reduce Tasks Aggregate Results
Write Output
Job Complete
The flow starts with job submission, then input data is split and processed by map tasks. Map outputs are shuffled and sorted, then reduce tasks aggregate results and write output, ending the job.
Execution Sample
Hadoop
1. Submit job
2. Split input
3. Run map tasks
4. Shuffle and sort
5. Run reduce tasks
6. Write output
This sequence shows the main steps of a MapReduce job from start to finish.
Execution Table
StepActionInputOutputNotes
1Submit JobUser programJob configurationJob is sent to Hadoop cluster
2Split Input DataLarge input fileInput splitsData divided for parallel processing
3Run Map TasksInput splits(key,value) pairsMap function processes each split
4Shuffle and Sort(key,value) pairsSorted (key, list(values))Groups values by key across all maps
5Run Reduce TasksSorted (key, list(values))Aggregated resultsReduce function processes grouped data
6Write OutputAggregated resultsOutput filesResults saved to distributed storage
7Job CompleteOutput filesSuccess statusJob finishes successfully
💡 Job completes after output is written and success status is returned
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5Final
Input DataLarge fileSplit into chunksChunks processed by mapMapped pairs readyReduced results readyOutput files
Map OutputN/AN/A(key,value) pairsGrouped by keyAggregated resultsSaved output
Reduce OutputN/AN/AN/AN/AAggregated resultsSaved output
Key Moments - 3 Insights
Why does the input data get split before mapping?
Splitting input allows parallel processing by multiple map tasks, speeding up the job as shown in execution_table step 2.
What happens during the shuffle and sort phase?
Shuffle and sort groups all map outputs by key so reduce tasks can process all values for each key together, as seen in step 4.
When is the job considered complete?
After reduce tasks write the output files and the job returns a success status, shown in steps 6 and 7.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the output after step 3 (Run Map Tasks)?
AInput splits
BAggregated results
C(key,value) pairs
DOutput files
💡 Hint
Check the 'Output' column for step 3 in the execution_table.
At which step does the job split the input data for parallel processing?
AStep 2
BStep 1
CStep 4
DStep 5
💡 Hint
Look at the 'Action' column and find where input data is divided.
If the shuffle and sort phase was skipped, what would likely happen?
AMap tasks would fail to run
BReduce tasks would receive unsorted data and might miss some keys
CInput data would not be split
DOutput files would be written earlier
💡 Hint
Refer to the purpose of shuffle and sort in grouping keys before reduce tasks.
Concept Snapshot
MapReduce job flow:
1. Submit job
2. Split input data
3. Map tasks process splits
4. Shuffle and sort map outputs
5. Reduce tasks aggregate data
6. Write output files
Splitting enables parallelism; shuffle groups keys for reduce.
Full Transcript
A MapReduce job starts when a user submits it to the Hadoop cluster. The input data is split into smaller chunks so multiple map tasks can run in parallel. Each map task processes its input split and produces key-value pairs. These pairs are shuffled and sorted to group all values by their keys. Reduce tasks then process each key and its list of values to aggregate results. Finally, the reduce outputs are written to output files in distributed storage. The job completes successfully after writing output. This flow enables processing large data efficiently by dividing work and combining results.