Hadoop ecosystem overview - Time & Space Complexity
We want to understand how the time to process data grows as the data size increases in the Hadoop ecosystem.
How does Hadoop handle bigger data and what parts take more time?
Analyze the time complexity of a simple MapReduce job in Hadoop.
// Mapper class
public class MyMapper extends Mapper {
public void map(LongWritable key, Text value, Context context) {
// process each line of input
// emit key-value pairs
}
}
// Reducer class
public class MyReducer extends Reducer {
public void reduce(Text key, Iterable values, Context context) {
// sum values for each key
// emit final result
}
}
This code processes input data line by line, maps it to key-value pairs, then reduces by summing values per key.
Look at the loops and repeated steps in the MapReduce job.
- Primary operation: The mapper processes each input line once.
- How many times: Once per input line, so as many times as there are lines (n).
- Secondary operation: The reducer processes each unique key and sums its values.
- How many times: Once per unique key, which depends on data but usually less than n.
As the input data grows, the mapper runs more times, once per line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 map operations |
| 100 | About 100 map operations |
| 1000 | About 1000 map operations |
Pattern observation: The number of operations grows roughly in direct proportion to the input size.
Time Complexity: O(n)
This means the time to run the job grows linearly with the amount of input data.
[X] Wrong: "The reducer runs as many times as the input lines."
[OK] Correct: The reducer runs once per unique key, which is usually fewer than input lines, so it does less work than the mapper.
Understanding how Hadoop scales with data size helps you explain real-world data processing and shows you can think about performance clearly.
"What if the reducer had to sort all keys before processing? How would the time complexity change?"