Confidence intervals on parameters in SciPy - Time & Space Complexity
We want to understand how the time needed to calculate confidence intervals changes as we have more data.
How does the work grow when we increase the number of data points?
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy import stats
data = np.random.normal(loc=0, scale=1, size=1000)
mean = np.mean(data)
sem = stats.sem(data)
confidence = 0.95
h = sem * stats.t.ppf((1 + confidence) / 2., len(data)-1)
interval = (mean - h, mean + h)
This code calculates a 95% confidence interval for the mean of a dataset.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating the mean and standard error of the data, which involves going through all data points.
- How many times: Each data point is visited once when computing the mean and once when computing the standard error.
As the number of data points grows, the time to calculate the mean and standard error grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 (mean + sem calculations) |
| 100 | About 200 |
| 1000 | About 2000 |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to compute confidence intervals grows linearly with the number of data points.
[X] Wrong: "Calculating confidence intervals takes the same time no matter how much data there is."
[OK] Correct: The calculations must look at each data point to find the mean and error, so more data means more work.
Understanding how data size affects calculation time helps you explain your code's efficiency clearly and confidently.
"What if we used a more complex method that requires multiple passes over the data? How would the time complexity change?"