MapReduce job execution flow in Hadoop - Time & Space Complexity
We want to understand how the time to run a MapReduce job changes as the input data grows.
How does the job's execution time grow when we add more data to process?
Analyze the time complexity of the following MapReduce job execution flow.
// Simplified MapReduce job flow
Job job = new Job(conf, "word count");
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.waitForCompletion(true);
This code sets up and runs a MapReduce job that counts words in input files.
- Primary operation: The Mapper processes each input record once.
- How many times: Once per input record, so as many times as there are records.
- Secondary operation: The Reducer processes each unique key and its values.
- How many times: Once per unique key, which depends on data.
As input data grows, the Mapper runs more times, once per record.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 map operations |
| 100 | About 100 map operations |
| 1000 | About 1000 map operations |
Pattern observation: The number of map operations grows directly with input size.
Time Complexity: O(n)
This means the job's running time grows roughly in direct proportion to the amount of input data.
[X] Wrong: "The Reducer runs once regardless of input size."
[OK] Correct: The Reducer runs once per unique key, so if input grows and produces more unique keys, the Reducer work also grows.
Understanding how MapReduce scales with data size helps you explain how big data jobs handle growth efficiently.
"What if we added a Combiner step before the Reducer? How would the time complexity change?"