Reduce phase explained in Hadoop - Time & Space Complexity
We want to understand how the time taken by the Reduce phase changes as the input data grows.
How does the Reduce phase handle more data and how does that affect its speed?
Analyze the time complexity of the following Reduce function.
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
This code sums all values for each key and writes the total.
Look at what repeats inside the Reduce function.
- Primary operation: Looping over all values for one key to sum them.
- How many times: Once per key, looping through all values linked to that key.
As the number of values per key grows, the time to sum them grows too.
| Input Size (values per key) | Approx. Operations (sum steps) |
|---|---|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |
Pattern observation: The time grows directly with the number of values to sum.
Time Complexity: O(n)
This means the time to reduce grows in a straight line with the number of values for each key.
[X] Wrong: "The Reduce phase always takes the same time no matter how many values there are."
[OK] Correct: The Reduce phase must look at every value for a key, so more values mean more work and more time.
Understanding how the Reduce phase scales helps you explain how big data jobs handle large inputs efficiently.
"What if the Reduce function had to sort the values before summing? How would the time complexity change?"