Map phase explained in Hadoop - Time & Space Complexity
We want to understand how the time needed for the Map phase changes as the input data grows.
How does the Map phase handle more data and how does that affect processing time?
Analyze the time complexity of the following code snippet.
public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
}
This code reads each line of input, splits it into words, and outputs each word with a count of one.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over all words in each input line.
- How many times: Once per word in the input data.
As the input data grows, the number of words grows roughly in proportion.
| Input Size (n words) | Approx. Operations |
|---|---|
| 10 | 10 loops over words |
| 100 | 100 loops over words |
| 1000 | 1000 loops over words |
Pattern observation: The work grows directly with the number of words; doubling words doubles work.
Time Complexity: O(n)
This means the Map phase time grows linearly with the number of words processed.
[X] Wrong: "The Map phase time stays the same no matter how much data there is."
[OK] Correct: The Map phase processes each word once, so more words mean more work and more time.
Understanding how the Map phase scales helps you explain how big data jobs handle large inputs efficiently.
"What if the map function also had to check each word against a dictionary? How would the time complexity change?"