Goodness of fit evaluation in SciPy - Time & Space Complexity
We want to know how the time needed to check how well a model fits data changes as the data size grows.
How does the cost of evaluating goodness of fit grow when we have more data points?
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy.stats import chisquare
observed = np.array([20, 30, 50])
expected = np.array([25, 25, 50])
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)
print(f"Chi-square statistic: {chi2_stat}, p-value: {p_value}")
This code calculates the chi-square test statistic and p-value to see how well observed data matches expected data.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Element-wise subtraction, division, and squaring for each data point.
- How many times: Once for each data point in the observed and expected arrays.
Each data point requires a fixed number of calculations. More data points mean more calculations.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 sets of calculations |
| 100 | About 100 sets of calculations |
| 1000 | About 1000 sets of calculations |
Pattern observation: The number of operations grows directly with the number of data points.
Time Complexity: O(n)
This means the time to evaluate goodness of fit grows linearly with the number of data points.
[X] Wrong: "The test runs in constant time no matter how much data there is."
[OK] Correct: Each data point must be checked, so more data means more work and more time.
Understanding how time grows with data size helps you explain your code choices clearly and shows you think about efficiency.
"What if we used a more complex goodness of fit test that compares pairs of data points? How would the time complexity change?"