Migration from Hadoop to cloud-native - Time & Space Complexity
When moving data processing from Hadoop to cloud-native systems, it is important to understand how the time to complete tasks changes.
We want to know how the work grows as data size grows during this migration.
Analyze the time complexity of the following Hadoop MapReduce job.
// Hadoop MapReduce example
public class WordCount {
public static class Mapper extends Mapper {
public void map(LongWritable key, Text value, Context context) {
for (String word : value.toString().split(" ")) {
context.write(new Text(word), new IntWritable(1));
}
}
}
public static class Reducer extends Reducer {
public void reduce(Text key, Iterable values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
}
This code counts how many times each word appears in a large text dataset using Hadoop MapReduce.
Look at the loops that repeat work:
- Primary operation: The Mapper loops over every word in each input line.
- How many times: Once for every word in the input data.
- Secondary operation: The Reducer loops over all counts for each word to sum them.
- How many times: Once for each occurrence of each word.
As the input data size grows, the number of words grows roughly in the same way.
| Input Size (n words) | Approx. Operations |
|---|---|
| 10 | About 10 word processing steps |
| 100 | About 100 word processing steps |
| 1000 | About 1000 word processing steps |
Pattern observation: The work grows directly with the number of words; doubling words doubles work.
Time Complexity: O(n)
This means the time to process grows in a straight line with the amount of data.
[X] Wrong: "Migrating to cloud-native will always make processing faster regardless of data size."
[OK] Correct: The time still depends on data size; cloud tools may be faster per unit but the overall growth with data remains linear.
Understanding how data size affects processing time helps you explain migration benefits clearly and shows you grasp core data engineering concepts.
"What if we changed the Reducer to combine counts before sending data? How would the time complexity change?"