Multiple aggregation functions in Pandas - Time & Space Complexity
We want to understand how the time needed to run multiple aggregation functions on data grows as the data gets bigger.
How does adding more data affect the work pandas does when summarizing it?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'B', 'A', 'B', 'A', 'B'],
'value': [10, 20, 30, 40, 50, 60]
})
result = df.groupby('group').agg({'value': ['sum', 'mean', 'max']})
This code groups data by a column and calculates sum, mean, and max for each group.
- Primary operation: pandas loops internally over each group to compute each aggregation.
- How many times: For each group, it processes all rows once per aggregation function.
As the number of rows grows, pandas must look at more data for each aggregation.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 (10 rows x 3 functions) |
| 100 | About 300 (100 rows x 3 functions) |
| 1000 | About 3000 (1000 rows x 3 functions) |
Pattern observation: The work grows roughly in direct proportion to the number of rows and the number of aggregation functions.
Time Complexity: O(n)
This means the time to run these aggregations grows linearly with the number of rows in the data.
[X] Wrong: "Adding more aggregation functions multiplies the time by the square of the data size."
[OK] Correct: Each aggregation looks at the data once, so time grows with data size times number of functions, not squared.
Knowing how aggregation time grows helps you explain performance when working with grouped data, a common task in data analysis.
"What if we added more grouping columns? How would the time complexity change?"