Correlation with corr() in Data Analysis Python - Time & Space Complexity
We want to understand how the time to calculate correlation grows as the data size increases.
Specifically, how does the corr() function behave when working with larger datasets?
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 1000 # example size
data = pd.DataFrame({
'A': range(n),
'B': range(n, 0, -1)
})
result = data.corr()
This code creates a DataFrame with two columns of length n and calculates their correlation matrix.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Pairwise correlation calculation between columns.
- How many times: For each pair of columns, it processes all n rows once.
As the number of rows n grows, the time to compute correlation grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations per column pair |
| 100 | About 100 operations per column pair |
| 1000 | About 1000 operations per column pair |
Pattern observation: Doubling the number of rows roughly doubles the work needed.
Time Complexity: O(n * m^2)
This means the time grows linearly with the number of rows and quadratically with the number of columns.
[X] Wrong: "Correlation calculation time depends only on the number of rows."
[OK] Correct: Because correlation is calculated for every pair of columns, the number of columns also affects the time, especially when there are many columns.
Understanding how correlation scales helps you explain data processing costs clearly and shows you can think about efficiency in real data tasks.
"What if we calculate correlation only between two columns instead of all pairs? How would the time complexity change?"