Why statistics validates hypotheses in Data Analysis Python - Performance Analysis
When we use statistics to check hypotheses, we run calculations on data to see if our ideas hold true.
We want to know how the time to do these checks grows as we get more data.
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy import stats
def test_hypothesis(data):
mean_val = np.mean(data)
t_stat, p_val = stats.ttest_1samp(data, popmean=0)
return t_stat, p_val
sample_data = np.random.randn(1000)
test_hypothesis(sample_data)
This code calculates the average of data and runs a t-test to check if the data mean differs from zero.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating the mean and t-test both scan through the data array once.
- How many times: Each operation goes through all n data points once.
As the data size grows, the time to calculate mean and run the test grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 operations (two passes over 10 items) |
| 100 | About 200 operations |
| 1000 | About 2000 operations |
Pattern observation: Doubling data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to validate a hypothesis grows linearly with the amount of data.
[X] Wrong: "Running a statistical test takes the same time no matter how much data we have."
[OK] Correct: The test must look at each data point, so more data means more work and more time.
Understanding how time grows with data size helps you explain the cost of statistical checks clearly and confidently.
"What if we used a bootstrap method with 1000 resamples? How would the time complexity change?"