filter() for group-level filtering in Pandas - Time & Space Complexity
We want to understand how the time needed changes when we use filter() on groups in pandas.
Specifically, how does the work grow as the data gets bigger?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Value': [10, 15, 10, 5, 20, 25]
})
grouped = df.groupby('Category')
filtered = grouped.filter(lambda x: x['Value'].mean() > 12)
This code groups data by 'Category' and keeps only groups where the average 'Value' is more than 12.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The
filter()applies a function to each group, calculating the mean of 'Value' in that group. - How many times: Once for each group in the data.
As the number of rows grows, the number of groups and their sizes affect the work done.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to check values in groups |
| 100 | About 100 operations, since each row is checked once in its group |
| 1000 | About 1000 operations, scaling with total rows |
Pattern observation: The work grows roughly in direct proportion to the total number of rows.
Time Complexity: O(n)
This means the time needed grows linearly with the number of rows in the data.
[X] Wrong: "Filtering groups with filter() is constant time regardless of data size."
[OK] Correct: The function runs on each group and each row inside, so more data means more work.
Knowing how group filtering scales helps you explain your choices clearly and shows you understand data size impact.
"What if the filtering function was more complex, like calculating median instead of mean? How would the time complexity change?"