Pig Latin basics in Hadoop - Time & Space Complexity
We want to understand how the time to run Pig Latin scripts changes as the data size grows.
How does the script's work increase when we have more data?
Analyze the time complexity of the following Pig Latin script.
data = LOAD 'input' AS (name:chararray, age:int);
adults = FILTER data BY age >= 18;
grouped = GROUP adults BY name;
counts = FOREACH grouped GENERATE group, COUNT(adults);
STORE counts INTO 'output';
This script loads data, filters adults, groups by name, counts adults per name, and stores the result.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning all rows to filter and group data.
- How many times: Each row is processed once during filtering and once during grouping.
As the number of rows grows, the script processes each row in steps like filtering and grouping.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 operations (filter + group) |
| 100 | About 200 operations |
| 1000 | About 2000 operations |
Pattern observation: Operations grow roughly in direct proportion to input size.
Time Complexity: O(n)
This means the time to run the script grows linearly as the data size grows.
[X] Wrong: "Grouping data takes constant time no matter how big the data is."
[OK] Correct: Grouping must look at each row to organize it, so it takes more time as data grows.
Understanding how data size affects Pig Latin scripts helps you explain your approach clearly and shows you know how big data tools work.
"What if we added a nested FOREACH inside the grouping step? How would the time complexity change?"