Aggregation with agg() in Pandas - Time & Space Complexity
We want to understand how the time needed to aggregate data grows as the data gets bigger.
How does using agg() on a DataFrame scale with more rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'group': ['A', 'B', 'A', 'B', 'A'],
'value': [10, 20, 30, 40, 50]
})
result = data.groupby('group').agg({'value': ['sum', 'mean']})
This code groups data by the 'group' column and calculates the sum and mean of 'value' for each group.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Iterating over each row to assign it to a group and then aggregating values per group.
- How many times: Once for each row in the DataFrame during grouping, then once per group for aggregation.
As the number of rows grows, the time to group and aggregate grows roughly in proportion to the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to assign groups + aggregation |
| 100 | About 100 operations to assign groups + aggregation |
| 1000 | About 1000 operations to assign groups + aggregation |
Pattern observation: The operations grow roughly linearly as the data size increases.
Time Complexity: O(n)
This means the time to run agg() grows roughly in direct proportion to the number of rows.
[X] Wrong: "Aggregation with agg() takes constant time no matter how big the data is."
[OK] Correct: The function must look at each row to group and calculate, so more rows mean more work.
Understanding how aggregation scales helps you explain data processing speed clearly and shows you know how data size affects performance.
"What if we added multiple columns to aggregate with agg()? How would the time complexity change?"