ANOVA in Data Analysis Python - Time & Space Complexity
We want to understand how the time needed to run ANOVA changes when we have more groups or more data points.
How does the work grow as the data size grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
from scipy import stats
data = pd.DataFrame({
'group': ['A']*50 + ['B']*50 + ['C']*50,
'value': list(range(50)) + list(range(50, 100)) + list(range(100, 150))
})
f_val, p_val = stats.f_oneway(
data[data['group'] == 'A']['value'],
data[data['group'] == 'B']['value'],
data[data['group'] == 'C']['value']
)
This code runs ANOVA to compare means of three groups, each with 50 data points.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating group means and variances by scanning each data point.
- How many times: Each data point is visited once to compute sums and variances.
As the number of data points grows, the time to compute group statistics grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to sum and calculate variance |
| 100 | About 100 operations |
| 1000 | About 1000 operations |
Pattern observation: Doubling data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to run ANOVA grows linearly with the total number of data points.
[X] Wrong: "ANOVA time grows with the square of data size because it compares all pairs of points."
[OK] Correct: ANOVA calculates group statistics by scanning data once, not by comparing every pair.
Understanding how ANOVA scales helps you explain performance when working with bigger datasets in real projects.
"What if we increased the number of groups instead of data points? How would the time complexity change?"