Why statistics with NumPy matters - Performance Analysis
We want to know how long it takes to calculate statistics using NumPy as the data size grows.
How does the time needed change when we have more numbers to analyze?
Analyze the time complexity of the following code snippet.
import numpy as np
data = np.random.rand(1000000)
mean_value = np.mean(data)
std_dev = np.std(data)
median_value = np.median(data)
This code creates a large list of numbers and calculates the mean, standard deviation, and median.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: NumPy goes through the entire list of numbers to calculate each statistic.
- How many times: Each statistic calculation scans all numbers once.
As the list of numbers gets bigger, the time to calculate each statistic grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations per statistic |
| 100 | About 100 operations per statistic |
| 1000 | About 1000 operations per statistic |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to calculate statistics grows linearly with the number of data points.
[X] Wrong: "Calculating the median is faster than the mean because it's just one value."
[OK] Correct: Finding the median requires looking at all numbers to sort or select the middle, so it still takes time proportional to the data size.
Understanding how statistical calculations scale helps you explain your code's efficiency clearly and shows you know how to handle big data smoothly.
"What if we used a streaming method that updates the mean without looking at all data again? How would the time complexity change?"