Word count as MapReduce example in Hadoop - Time & Space Complexity
We want to understand how the time needed to count words grows as the input text gets bigger.
How does the number of words affect the work done by MapReduce?
Analyze the time complexity of the following code snippet.
public class WordCount {
public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<Object, Text, Text, IntWritable> {
public void map(Object key, Text value, Context context) {
String[] words = value.toString().split("\\s+");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
}
public static class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
}
This code counts how many times each word appears in the input text using MapReduce.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each word in the input text in the Mapper, and loop over counts for each word in the Reducer.
- How many times: The Mapper loops once per word in the input; the Reducer loops once per unique word's counts.
As the number of words grows, the Mapper processes each word once, so work grows directly with input size.
| Input Size (n words) | Approx. Operations |
|---|---|
| 10 | About 10 map operations + reduce on unique words |
| 100 | About 100 map operations + reduce on unique words |
| 1000 | About 1000 map operations + reduce on unique words |
Pattern observation: The total work grows roughly in direct proportion to the number of words.
Time Complexity: O(n)
This means the time to count words grows linearly with the number of words in the input.
[X] Wrong: "The reduce step takes as much time as the map step because it loops over all words again."
[OK] Correct: The reduce step only loops over counts for each unique word, which is usually much smaller than total words.
Understanding how MapReduce scales with input size helps you explain distributed processing clearly and confidently.
"What if the input text had many repeated words? How would that affect the time complexity of the reduce step?"