Data lake design patterns in Hadoop - Time & Space Complexity
When working with data lakes, it is important to understand how the time to process data grows as the data size increases.
We want to know how the design pattern affects the speed of data processing in Hadoop.
Analyze the time complexity of the following Hadoop MapReduce job using a data lake design pattern.
// Mapper reads data from raw zone
map(key, value) {
// parse and filter records
if (record is valid) {
emit(key, value);
}
}
// Reducer aggregates filtered data
reduce(key, values) {
sum = 0;
for each v in values {
sum += v;
}
emit(key, sum);
}
This code reads raw data, filters it, and then aggregates results in the reduce step.
Look at the loops and repeated steps in the code.
- Primary operation: The reducer loops over all values for each key to sum them.
- How many times: The reducer runs once per unique key, processing all associated values.
As the input data grows, the number of records to process grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 map operations and sums over small groups |
| 100 | About 100 map operations and sums over larger groups |
| 1000 | About 1000 map operations and sums over even larger groups |
Pattern observation: The total work grows roughly in direct proportion to the input size.
Time Complexity: O(n)
This means the time to process data grows linearly as the data size increases.
[X] Wrong: "The reducer runs a fixed number of times regardless of data size."
[OK] Correct: The reducer runs once per unique key, so if data grows with more keys or values, the reducer work grows too.
Understanding how data lake design patterns affect processing time helps you explain how to handle big data efficiently in Hadoop.
"What if we added a combiner step before the reducer? How would the time complexity change?"