Input splits and data locality in Hadoop - Time & Space Complexity
When Hadoop processes big data, it breaks data into pieces called input splits.
We want to know how the time to process grows as the data size grows.
Analyze the time complexity of splitting input and processing with data locality.
// Pseudocode for input split processing
for each inputSplit in inputSplits:
if data is local:
process inputSplit
else:
fetch data remotely
process inputSplit
This code splits data and processes each piece, preferring local data to save time.
Look for repeated actions in the code.
- Primary operation: Loop over all input splits to process data.
- How many times: Once per input split, which depends on data size.
As data size grows, the number of input splits grows roughly in proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 splits | 10 processing steps |
| 100 splits | 100 processing steps |
| 1000 splits | 1000 processing steps |
Pattern observation: The work grows linearly as the number of splits grows.
Time Complexity: O(n)
This means the time to process grows directly with the number of input splits.
[X] Wrong: "Data locality makes processing time constant no matter the data size."
[OK] Correct: Even with local data, each split must be processed, so time still grows with data size.
Understanding how input splits and data locality affect processing time helps you explain Hadoop's efficiency clearly.
What if the input splits were uneven in size? How would that affect the time complexity?