Chi-squared test in Data Analysis Python - Time & Space Complexity
We want to understand how the time it takes to run a Chi-squared test changes as the size of the data grows.
Specifically, how does the number of operations increase when we have more categories or observations?
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy.stats import chi2_contingency
# Create a contingency table with n rows and m columns
n, m = 100, 5
data = np.random.randint(1, 10, size=(n, m))
# Perform Chi-squared test
chi2, p, dof, expected = chi2_contingency(data)
This code creates a table of counts and runs the Chi-squared test to check if rows and columns are independent.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating expected counts and summing squared differences for each cell in the table.
- How many times: Once for each cell in the contingency table, so n x m times.
As the number of rows or columns increases, the number of cells grows, so the work grows with the total number of cells.
| Input Size (n x m) | Approx. Operations |
|---|---|
| 10 x 5 = 50 | About 50 operations |
| 100 x 5 = 500 | About 500 operations |
| 1000 x 5 = 5000 | About 5000 operations |
Pattern observation: The operations grow roughly in direct proportion to the number of cells in the table.
Time Complexity: O(n x m)
This means the time to run the test grows linearly with the number of rows times columns in the data table.
[X] Wrong: "The Chi-squared test runs in constant time no matter the data size."
[OK] Correct: The test must look at every cell to calculate expected counts and differences, so more data means more work.
Understanding how the Chi-squared test scales helps you explain performance when working with larger datasets in real projects.
"What if we changed the data to a 3D contingency table? How would the time complexity change?"