When to use Hadoop in modern data stacks - Time & Space Complexity
We want to understand how the time it takes to run Hadoop jobs grows as data size increases.
This helps us know when Hadoop is a good choice in modern data setups.
Analyze the time complexity of a simple Hadoop MapReduce job that counts words.
// Mapper function
map(key, value) {
for each word in value {
emit(word, 1);
}
}
// Reducer function
reduce(word, counts) {
sum = 0;
for each count in counts {
sum += count;
}
emit(word, sum);
}
This job reads text data, counts how many times each word appears, and outputs totals.
Look at the loops that repeat work:
- Primary operation: The mapper loops over every word in the input data.
- How many times: Once for each word in the entire dataset.
- Secondary operation: The reducer loops over counts for each unique word.
- How many times: Once for each occurrence of that word.
As the input data grows, the number of words grows roughly in proportion.
| Input Size (n words) | Approx. Operations |
|---|---|
| 10 | About 10 map operations + reduce sums |
| 100 | About 100 map operations + reduce sums |
| 1000 | About 1000 map operations + reduce sums |
Pattern observation: The work grows roughly in direct proportion to the number of words.
Time Complexity: O(n)
This means the time to run the job grows linearly with the size of the input data.
[X] Wrong: "Hadoop always runs fast regardless of data size because it is distributed."
[OK] Correct: Hadoop splits work across machines, but total work still grows with data size, so bigger data means longer total processing time.
Understanding how Hadoop scales helps you explain when to choose it for big data tasks versus faster tools for smaller data.
"What if we changed the job to run only on a sample of the data? How would the time complexity change?"