LOAD, FILTER, and STORE operations in Hadoop - Time & Space Complexity
We want to understand how the time needed changes when we load, filter, and store data in Hadoop.
How does the size of data affect the work done by these operations?
Analyze the time complexity of the following code snippet.
data = LOAD 'input_data';
filtered = FILTER data BY some_condition;
STORE filtered INTO 'output_data';
This code loads data, filters rows based on a condition, then stores the filtered data.
Look at what repeats as data size grows.
- Primary operation: Filtering each row to check the condition.
- How many times: Once for every row in the loaded data.
As the number of rows increases, the filtering work grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 filter checks |
| 100 | About 100 filter checks |
| 1000 | About 1000 filter checks |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time grows in a straight line with the number of rows processed.
[X] Wrong: "Filtering only takes constant time no matter how much data there is."
[OK] Correct: Each row must be checked, so more rows mean more work.
Understanding how data size affects filtering helps you explain how Hadoop handles big data efficiently.
"What if we added a nested loop inside the filter condition? How would the time complexity change?"