P-values and significance in Data Analysis Python - Time & Space Complexity
We want to understand how the time to calculate p-values changes as the amount of data grows.
How does the work needed grow when we have more data points?
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy import stats
def calculate_p_value(data1, data2):
t_stat, p_val = stats.ttest_ind(data1, data2)
return p_val
# data1 and data2 are lists or arrays of numbers
This code calculates the p-value from two groups of data using a t-test.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The t-test function internally processes each data point in both groups.
- How many times: Each data point in both data1 and data2 is visited once to compute means and variances.
As the number of data points increases, the time to compute the p-value grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 (10 in each group) |
| 100 | About 200 |
| 1000 | About 2000 |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to calculate the p-value grows linearly with the total number of data points.
[X] Wrong: "Calculating a p-value takes the same time no matter how much data there is."
[OK] Correct: The calculation must look at each data point to find averages and variances, so more data means more work.
Understanding how data size affects calculation time helps you explain your approach clearly and shows you know what happens behind the scenes.
"What if we used a bootstrap method with 1000 resamples to estimate the p-value? How would the time complexity change?"