Hadoop vs Spark comparison - Performance Comparison
We want to understand how Hadoop and Spark handle growing data sizes in terms of time taken.
How does the time to process data increase as data gets bigger?
Analyze the time complexity of a simple word count job in Hadoop MapReduce.
// Hadoop MapReduce word count example
map(String key, String value) {
for (word : value.split()) {
emit(word, 1);
}
}
reduce(String word, Iterator counts) {
int sum = 0;
for (int count : counts) {
sum += count;
}
emit(word, sum);
}
This code counts how many times each word appears in a large text dataset using Hadoop.
Look at the loops that repeat work:
- Primary operation: The map function loops over every word in the input data.
- How many times: Once for each word in the dataset, so it depends on total words.
- The reduce function loops over counts for each unique word, summing them up.
- The dominant work is reading and processing every word once in the map step.
As the number of words grows, the time to process grows roughly the same way.
| Input Size (words) | Approx. Operations |
|---|---|
| 10 | About 10 map operations + reduce sums |
| 100 | About 100 map operations + reduce sums |
| 1000 | About 1000 map operations + reduce sums |
Pattern observation: Time grows roughly in direct proportion to the number of words.
Time Complexity: O(n)
This means the time to process grows linearly with the size of the input data.
[X] Wrong: "Spark is always faster than Hadoop because it uses memory."
[OK] Correct: Spark can be faster for many tasks, but for very large data that doesn't fit in memory, Hadoop's disk-based approach can be more stable and predictable.
Understanding how Hadoop and Spark scale with data size helps you explain trade-offs clearly and shows you know how big data tools work under the hood.
"What if we changed the Hadoop job to use Spark's in-memory caching? How would the time complexity change?"