Correlation analysis (Pearson, Spearman) in Data Analysis Python - Time & Space Complexity
We want to know how the time needed to calculate correlation changes as the data grows.
How does the work increase when we have more data points?
Analyze the time complexity of the following code snippet.
import pandas as pd
from scipy.stats import pearsonr, spearmanr
def compute_correlations(df, col1, col2):
pearson_corr, _ = pearsonr(df[col1], df[col2])
spearman_corr, _ = spearmanr(df[col1], df[col2])
return pearson_corr, spearman_corr
This code calculates Pearson and Spearman correlation between two columns in a data table.
Identify the loops, recursion, array traversals that repeat.
- Pearson: Single passes over data to compute means, std devs, covariance: O(n).
- Spearman: Sorting to compute ranks: O(n log n), followed by linear correlation on ranks: O(n).
- Dominant operation: Sorting in Spearman.
As the number of data points n increases, the time grows as O(n log n), dominated by sorting for Spearman ranks.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 100 (linear passes + sorting ~2×n log n) |
| 100 | About 1,500 (linear + sorting) |
| 1000 | About 20,000 (linear + sorting) |
Pattern observation: Grows faster than linear due to log n factor in sorting, but still very efficient.
Time Complexity: O(n log n)
This is dominated by the sorting step required for Spearman ranking; Pearson is O(n).
[X] Wrong: "Correlation calculation time grows much faster than the data size because it compares every pair of points."
[OK] Correct: No pairwise comparisons (that would be O(n2)). Pearson uses O(n) summary statistics; Spearman uses O(n log n) sorting for ranks.
Understanding how correlation calculations scale helps you explain data analysis efficiency clearly and confidently.
"What if we wanted to compute correlations between every pair of columns in a dataset? How would the time complexity change?"