Why Hadoop was created for big data - Performance Analysis
We want to understand why Hadoop was built to handle big data efficiently.
How does Hadoop manage growing amounts of data without slowing down too much?
Analyze the time complexity of a simple Hadoop MapReduce job.
// Hadoop MapReduce pseudocode
map(key, value):
for each word in value:
emit(word, 1)
reduce(word, counts):
sum = 0
for count in counts:
sum += count
emit(word, sum)
This code counts how many times each word appears in a large dataset.
Look at the loops that repeat work:
- Primary operation: The map function loops over every word in the input data.
- How many times: Once for each word in the entire dataset.
- The reduce function loops over all counts for each unique word.
- The dominant operation is the map loop over all words because it processes the full data.
As the data size grows, the number of words grows too.
| Input Size (words) | Approx. Map Operations |
|---|---|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |
Pattern observation: The work grows directly with the number of words. Double the words, double the work.
Time Complexity: O(n)
This means the time to process data grows in a straight line with the amount of data.
[X] Wrong: "Hadoop processes data instantly no matter how big it is."
[OK] Correct: Hadoop still needs to read and process every piece of data, so more data means more work and more time.
Understanding how Hadoop scales with data size shows you know why distributed systems matter for big data jobs.
"What if the reduce step also had to loop over a very large number of counts for each word? How would that affect the time complexity?"