Iterating over groups in Pandas - Time & Space Complexity
When we split data into groups and look at each group one by one, it takes time. We want to know how this time grows as the data gets bigger.
How does the time to go through all groups change when we have more data?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'B', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50]
})
groups = df.groupby('Category')
for name, group in groups:
print(name, group['Value'].sum())
This code splits the data by 'Category' and then goes through each group to add up the 'Value' numbers.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Looping over each group created by the split.
- How many times: Once for each unique category in the data.
As the data grows, the number of groups and the size of each group can change. The code visits every row once when grouping and then once more when summing values in each group.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to group and sum |
| 100 | About 100 operations to group and sum |
| 1000 | About 1000 operations to group and sum |
Pattern observation: The total work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to process grows linearly with the number of rows in the data.
[X] Wrong: "Grouping data is instant and does not add to the time cost."
[OK] Correct: Grouping requires looking at every row to decide which group it belongs to, so it takes time proportional to the data size.
Understanding how grouping and iterating over groups affects time helps you explain your code choices clearly and shows you know how data size impacts performance.
"What if we replaced the loop over groups with a vectorized aggregation? How would the time complexity change?"