t-test with scipy.stats in Data Analysis Python - Time & Space Complexity
We want to understand how the time needed to run a t-test changes as the data size grows.
How does the number of data points affect the work done by the t-test function?
Analyze the time complexity of the following code snippet.
from scipy import stats
data1 = [1, 2, 3, 4, 5]
data2 = [2, 3, 4, 5, 6]
result = stats.ttest_ind(data1, data2)
print(result)
This code runs an independent t-test to compare two lists of numbers.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The function calculates means and variances by going through each list of numbers.
- How many times: Each list is scanned once to compute summary statistics.
As the number of data points increases, the time to compute the t-test grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 (two lists scanned) |
| 100 | About 200 |
| 1000 | About 2000 |
Pattern observation: Doubling the data roughly doubles the work done.
Time Complexity: O(n)
This means the time to run the t-test grows linearly with the number of data points.
[X] Wrong: "The t-test time grows with the square of the data size because it compares every pair of points."
[OK] Correct: The t-test only needs summary statistics like means and variances, so it scans each list once, not every pair.
Understanding how statistical tests scale helps you write efficient data analysis code and explain your choices clearly.
"What if we used a bootstrap method with many resamples instead of a t-test? How would the time complexity change?"