Single and multiple column grouping in Data Analysis Python - Time & Space Complexity
When we group data by one or more columns, the computer organizes rows into sets. We want to know how the time to do this changes as the data grows.
How does grouping time grow when we add more rows or columns?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'City': ['NY', 'LA', 'NY', 'LA', 'NY'],
'Year': [2020, 2020, 2021, 2021, 2020],
'Sales': [100, 200, 150, 250, 300]
})
result = data.groupby(['City', 'Year']).sum()
This code groups sales data by city and year, then sums sales in each group.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each row once to assign it to a group.
- How many times: Exactly once per row in the data.
As the number of rows grows, the grouping step must look at each row once to decide its group.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: The number of operations grows directly with the number of rows.
Time Complexity: O(n)
This means the time to group grows in a straight line with the number of rows.
[X] Wrong: "Grouping by more columns makes the time grow much faster, like squared or worse."
[OK] Correct: Grouping still looks at each row once. More columns affect memory and grouping keys, but the main time is scanning rows, which grows linearly.
Understanding how grouping scales helps you explain data processing steps clearly. It shows you can think about how data size affects performance, a useful skill in real projects.
"What if we grouped the data multiple times in a loop? How would the time complexity change?"