Heatmaps for correlation in Data Analysis Python - Time & Space Complexity
We want to understand how the time to create a heatmap for correlation grows as the data size increases.
Specifically, how does the number of operations change when we calculate and display correlations for more variables?
Analyze the time complexity of the following code snippet.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.DataFrame({f'var{i}': range(1000) for i in range(10)})
corr = data.corr()
sns.heatmap(corr)
plt.show()
This code creates a correlation heatmap for 10 variables, each with 1000 data points.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Calculating pairwise correlations between variables.
- How many times: For each pair of variables, correlation is computed once, so roughly n x n times for n variables.
As the number of variables increases, the number of correlation calculations grows quickly.
| Input Size (n variables) | Approx. Operations (correlations) |
|---|---|
| 10 | 100 |
| 100 | 10,000 |
| 1000 | 1,000,000 |
Pattern observation: The operations grow roughly with the square of the number of variables.
Time Complexity: O(n²)
This means if you double the number of variables, the work to compute correlations roughly quadruples.
[X] Wrong: "Calculating correlations grows linearly with the number of variables."
[OK] Correct: Each variable pairs with every other variable, so the number of pairs grows much faster than the number of variables.
Understanding how correlation heatmaps scale helps you explain performance when working with many variables in real data projects.
"What if we only calculate correlations for a subset of variable pairs? How would the time complexity change?"