Why MapReduce parallelizes data processing in Hadoop - Performance Analysis
We want to understand how MapReduce handles big data faster by splitting work.
How does breaking data into parts affect the time it takes to finish processing?
Analyze the time complexity of the following MapReduce job setup.
// Pseudocode for MapReduce job
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(p); // number of reducers
// Input data split into p parts
// Each mapper processes one split in parallel
// Reducers aggregate mapper outputs
This code sets up a MapReduce job that splits input data into parts processed in parallel by mappers, then combines results with reducers.
Look at the repeated work done by mappers and reducers.
- Primary operation: Processing each data split in a mapper.
- How many times: Once per data split, all running at the same time.
As input data grows, it is split into more parts processed in parallel.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 GB | 10 splits processed in parallel |
| 100 GB | 100 splits processed in parallel |
| 1000 GB | 1000 splits processed in parallel |
Pattern observation: More data means more splits, but processing time per split stays about the same because they run at once.
Time Complexity: O(n / p)
This means the total work grows with data size n, but dividing it by p parallel tasks reduces the time to finish.
[X] Wrong: "MapReduce always runs in constant time no matter how big the data is."
[OK] Correct: Even though tasks run in parallel, total work grows with data size, so time depends on how many parallel tasks you have.
Understanding how parallelism splits work helps you explain why big data tools like MapReduce can handle huge datasets efficiently.
"What if the number of parallel tasks p is fixed and data size n keeps growing? How does that affect time complexity?"