Why statistics quantifies uncertainty in SciPy - Performance Analysis
When we use statistics to measure uncertainty, we often run calculations on data samples.
We want to know how the time to do these calculations grows as the data size grows.
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy import stats
n = 1000 # Define n before using it
data = np.random.normal(loc=0, scale=1, size=n)
mean = np.mean(data)
conf_int = stats.norm.interval(alpha=0.95, loc=mean, scale=np.std(data)/np.sqrt(n))
This code generates data, calculates the mean, and finds a 95% confidence interval to quantify uncertainty.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating the mean and standard deviation by scanning all data points.
- How many times: Each data point is visited once for mean and once for standard deviation.
As the number of data points increases, the time to calculate mean and standard deviation grows proportionally.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 (two passes over 10 points) |
| 100 | About 200 |
| 1000 | About 2000 |
Pattern observation: The operations grow roughly in direct proportion to the input size.
Time Complexity: O(n)
This means the time to quantify uncertainty grows linearly as the data size grows.
[X] Wrong: "Calculating uncertainty takes the same time no matter how much data there is."
[OK] Correct: Because mean and standard deviation require looking at every data point, more data means more work.
Understanding how time grows with data size helps you explain the cost of statistical calculations clearly and confidently.
"What if we used a streaming method that updates mean and variance without storing all data? How would the time complexity change?"