Grouping by multiple columns in Pandas - Time & Space Complexity
When we group data by multiple columns, we want to see how the time to do this grows as the data gets bigger.
We ask: How does the work increase when we add more rows or more groups?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'City': ['NY', 'LA', 'NY', 'LA', 'NY'],
'Year': [2020, 2020, 2021, 2021, 2020],
'Sales': [100, 200, 150, 250, 300]
})
grouped = data.groupby(['City', 'Year']).sum()
This code groups sales data by both city and year, then sums the sales in each group.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each row to find its group based on city and year.
- How many times: Once for each row in the data.
As the number of rows grows, the time to group and sum grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to assign groups and sum |
| 100 | About 100 operations |
| 1000 | About 1000 operations |
Pattern observation: Doubling the rows roughly doubles the work needed.
Time Complexity: O(n)
This means the time to group by multiple columns grows linearly with the number of rows.
[X] Wrong: "Grouping by more columns makes the operation take much longer than just one column, like multiplying time by the number of columns."
[OK] Correct: Actually, grouping time depends mostly on the number of rows, not the number of grouping columns. Adding columns changes how groups are identified but does not multiply the work by the number of columns.
Understanding how grouping scales helps you explain data processing steps clearly and shows you can think about efficiency in real tasks.
"What if we grouped by only one column instead of two? How would the time complexity change?"