How to Use corr in pandas: Calculate Correlation Easily
Use the
corr() method on a pandas DataFrame to calculate the correlation matrix between its numeric columns. It returns a new DataFrame showing correlation coefficients, which measure how strongly columns relate to each other.Syntax
The basic syntax of corr() in pandas is:
DataFrame.corr(method='pearson', min_periods=1)
method: The correlation method to use. Default is 'pearson'. Other options include 'kendall' and 'spearman'.
min_periods: Minimum number of observations required per pair of columns to have a valid result.
python
DataFrame.corr(method='pearson', min_periods=1)
Example
This example shows how to calculate the correlation matrix of a DataFrame with numeric columns using corr(). It helps understand relationships between columns.
python
import pandas as pd data = { 'age': [25, 32, 47, 51, 62], 'income': [50000, 60000, 80000, 90000, 120000], 'score': [200, 220, 250, 270, 300] } df = pd.DataFrame(data) correlation_matrix = df.corr() print(correlation_matrix)
Output
age income score
age 1.000000 0.981981 0.981981
income 0.981981 1.000000 1.000000
score 0.981981 1.000000 1.000000
Common Pitfalls
Common mistakes when using corr() include:
- Trying to calculate correlation on non-numeric columns, which will be ignored or cause errors.
- Not handling missing values, which can affect results.
- Assuming correlation implies causation; correlation only shows association strength.
python
import pandas as pd data = { 'age': [25, 32, None, 51, 62], 'income': [50000, 60000, 80000, None, 120000], 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'] } df = pd.DataFrame(data) # corr() automatically ignores non-numeric columns correlation = df.corr() print(correlation) # Missing values are ignored pairwise by default
Output
age income
age 1.000000 0.981981
income 0.981981 1.000000
Quick Reference
Summary tips for using corr():
- Use
method='pearson'for linear correlation (default). - Use
method='kendall'ormethod='spearman'for rank-based correlation. - Missing values are ignored pairwise by default.
- Only numeric columns are considered.
| Parameter | Description | Default |
|---|---|---|
| method | Correlation method: 'pearson', 'kendall', or 'spearman' | 'pearson' |
| min_periods | Minimum observations required per pair | 1 |
| numeric_only | Consider only numeric columns (automatic) | True |
Key Takeaways
Use DataFrame.corr() to get correlation matrix of numeric columns easily.
Default method is 'pearson' for linear correlation; others include 'kendall' and 'spearman'.
Non-numeric columns are ignored automatically by corr().
Missing values are handled pairwise and do not cause errors.
Correlation shows association strength, not cause-effect relationships.