Outlier detection with IQR in Pandas - Time & Space Complexity
We want to understand how the time needed to find outliers using IQR changes as the data grows.
How does the work increase when we have more data points?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Assume df is a DataFrame with a numeric column 'values'
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['values'] < (Q1 - 1.5 * IQR)) | (df['values'] > (Q3 + 1.5 * IQR))]
This code calculates the IQR and finds rows where values are outliers based on that range.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Pandas
quantileinternally sorts the 'values' column (O(n log n)) to compute order statistics, plus O(n) filtering. - How many times: Two quantile calls (each O(n log n)) and one filtering pass (O(n)); dominated by sorting.
As the number of data points grows, the time grows as O(n log n) due to sorting in quantiles.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 70 (2 sorts ~60 + 10 filter) |
| 100 | About 1,400 |
| 1000 | About 21,000 |
Pattern observation: Doubling n more than doubles the work due to the log n factor.
Time Complexity: O(n log n)
Dominated by sorting within pandas quantile calls.
[X] Wrong: "It's O(n) because we just scan the data a few times."
[OK] Correct: quantile requires sorting to find precise percentiles, costing O(n log n).
Understanding how data size affects outlier detection helps you explain your approach clearly and shows you think about efficiency in real data tasks.
"Pandas quantile uses sorting (O(n log n)). What if we used a linear-time selection algorithm (like quickselect) for exact quantiles? How would complexity change?"