describe() for statistics in Data Analysis Python - Time & Space Complexity
We want to understand how the time to run describe() changes as the data size grows.
How does the work inside describe() scale with more data?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.Series([1, 2, 3, 4, 5])
summary = data.describe()
print(summary)
This code calculates basic statistics like count, mean, min, max, and quartiles for a data series.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The function scans through all data points to compute statistics.
- How many times: Each statistic requires at least one pass over the data, but some can be combined.
As the number of data points grows, the time to compute statistics grows roughly in a straight line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations per statistic |
| 100 | About 100 operations per statistic |
| 1000 | About 1000 operations per statistic |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to run describe() grows linearly with the number of data points.
[X] Wrong: "describe() runs instantly no matter how big the data is."
[OK] Correct: The function must look at every data point to calculate statistics, so more data means more work and more time.
Understanding how basic statistics scale helps you explain data processing speed clearly and confidently in real projects.
"What if describe() was called on a DataFrame with many columns instead of a single Series? How would the time complexity change?"