Chi-squared test in SciPy - Time & Space Complexity
We want to understand how the time needed to run a chi-squared test changes as the data size grows.
Specifically, how does the test's execution time grow when we have more categories or observations?
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy.stats import chi2_contingency
# Create a contingency table with n rows and m columns
n, m = 100, 5
data = np.random.randint(1, 10, size=(n, m))
# Perform chi-squared test
chi2, p, dof, expected = chi2_contingency(data)
This code creates a table of counts and runs the chi-squared test to check if rows and columns are independent.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating expected frequencies and summing over all cells in the contingency table.
- How many times: Once for each cell in the table, so n times m.
As the number of rows or columns increases, the number of cells grows, so the work grows too.
| Input Size (n x m) | Approx. Operations |
|---|---|
| 10 x 5 | 50 |
| 100 x 5 | 500 |
| 1000 x 5 | 5000 |
Pattern observation: The operations grow roughly in direct proportion to the number of cells in the table.
Time Complexity: O(n * m)
This means the time needed grows linearly with the total number of cells in the contingency table.
[X] Wrong: "The chi-squared test runs in constant time no matter the data size."
[OK] Correct: The test must look at every cell to calculate expected counts and contributions, so more data means more work.
Understanding how the chi-squared test scales helps you explain performance when working with larger datasets in real projects.
"What if we changed the contingency table to be square with n rows and n columns? How would the time complexity change?"