Percentiles and quantiles in SciPy - Time & Space Complexity
We want to know how the time to calculate percentiles or quantiles changes as the data size grows.
How does the work increase when we have more numbers to analyze?
Analyze the time complexity of the following code snippet.
import numpy as np
data = np.random.rand(1000)
percentile_90 = np.percentile(data, 90)
quantile_25 = np.quantile(data, 0.25)
This code generates 1000 random numbers and finds the 90th percentile and 25th quantile.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Sorting the array to find the percentile or quantile.
- How many times: The sorting happens once per percentile or quantile calculation.
Sorting takes more time as the list gets longer, but not in a simple doubling way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 operations |
| 100 | About 700 operations |
| 1000 | About 10,000 operations |
Pattern observation: The operations grow a bit faster than the input size, but not as fast as squaring it.
Time Complexity: O(n log n)
This means the time to find percentiles grows a bit faster than the number of data points because of sorting.
[X] Wrong: "Finding a percentile is just picking one number, so it should be very fast regardless of data size."
[OK] Correct: To find the correct percentile, the data usually needs to be sorted first, which takes more time as data grows.
Understanding how sorting affects percentile calculations helps you explain performance in data analysis tasks clearly and confidently.
What if we used an approximate method that does not sort the entire data? How would the time complexity change?