Hadoop distributions (Cloudera, Hortonworks) - Time & Space Complexity
When working with Hadoop distributions like Cloudera and Hortonworks, it's important to understand how the time to process data grows as the data size increases.
We want to know how the execution time changes when we run jobs on these platforms.
Analyze the time complexity of the following Hadoop MapReduce job setup.
// Pseudocode for a MapReduce job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
This code sets up a simple word count job that reads input data, maps words, and reduces counts.
Look at the parts that repeat as data grows.
- Primary operation: The Mapper processes each input record (line or chunk of text).
- How many times: Once for every input record, so as input size grows, this runs more times.
- Reducer operation: Combines mapped data by key, runs once per unique key but depends on data distribution.
As the input data size increases, the number of map tasks grows roughly in proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 MB | Processes about 1 map task |
| 100 MB | Processes about 10 map tasks |
| 1000 MB | Processes about 100 map tasks |
Pattern observation: The execution time grows roughly linearly with input size because each record is processed once in the map phase.
Time Complexity: O(n)
This means the time to complete the job grows roughly in direct proportion to the amount of input data.
[X] Wrong: "The reducer time grows the same as the input size."
[OK] Correct: Reducers work on grouped keys, so their time depends on the number of unique keys, not total input size.
Understanding how Hadoop jobs scale with data size shows you can predict performance and design efficient data pipelines.
"What if we added a combiner step before the reducer? How would that affect the time complexity?"