Data aggregation reporting in Pandas - Time & Space Complexity
When we use pandas to summarize data, like finding averages or totals, it takes some time to do the work.
We want to know how this time changes when the data gets bigger.
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'Category': ['A', 'B', 'A', 'B', 'C', 'A'],
'Value': [10, 20, 30, 40, 50, 60]
})
result = data.groupby('Category').agg({'Value': 'sum'})
This code groups data by categories and sums the values in each group.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: pandas loops internally over each row to assign it to a group.
- How many times: Once for each row in the data (n times).
As the number of rows grows, the time to group and sum grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to assign and sum |
| 100 | About 100 operations |
| 1000 | About 1000 operations |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to group and sum grows linearly with the number of rows.
[X] Wrong: "Grouping data takes the same time no matter how many rows there are."
[OK] Correct: The operation must look at each row to decide its group, so more rows mean more work.
Understanding how grouping scales helps you explain your code choices clearly and shows you know how data size affects performance.
"What if we grouped by two columns instead of one? How would the time complexity change?"