MlopsHow-ToBeginner · 3 min read

How to Use Correlation Matrix for Features in Python

Use pandas.DataFrame.corr() to compute the correlation matrix of features in Python. This matrix shows how strongly each feature relates to others, helping you identify redundant or important features for machine learning.

📐

Syntax

The correlation matrix is computed using DataFrame.corr() in pandas. It returns a table showing correlation coefficients between pairs of features.

df.corr(): Computes pairwise correlation of columns.
Returns a DataFrame where rows and columns are features.
Values range from -1 (perfect negative) to 1 (perfect positive).

python

correlation_matrix = df.corr()

💻

Example

This example shows how to load a dataset, compute the correlation matrix, and print it. It demonstrates how to identify feature relationships.

python

import pandas as pd
from sklearn.datasets import load_iris

# Load iris dataset as a pandas DataFrame
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Compute correlation matrix
correlation_matrix = df.corr()

# Print correlation matrix
print(correlation_matrix)

Output

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) sepal length (cm) 1.000000 -0.117570 0.871754 0.817941 sepal width (cm) -0.117570 1.000000 -0.428440 -0.366126 petal length (cm) 0.871754 -0.428440 1.000000 0.962865 petal width (cm) 0.817941 -0.366126 0.962865 1.000000

⚠️

Common Pitfalls

Common mistakes when using correlation matrices include:

Confusing correlation with causation — correlation only shows relationships, not cause.
Ignoring the scale of features — correlation works best with continuous numeric data.
Using correlation on categorical features without encoding — this can give misleading results.
Not handling missing values before computing correlation, which can cause errors or incorrect values.

python

import pandas as pd

# Wrong: correlation on categorical data without encoding
cat_df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})
try:
    print(cat_df.corr())
except Exception as e:
    print(f'Error: {e}')

# Right: encode categorical data before correlation
cat_df_encoded = pd.get_dummies(cat_df)
print(cat_df_encoded.corr())

Output

Error: No numeric types to aggregate blue green red blue 1.0 0.0 0.0 green 0.0 1.0 0.0 red 0.0 0.0 1.0

📊

Quick Reference

Use: df.corr() to get correlation matrix.
Interpret: Values near 1 or -1 show strong positive or negative correlation.
Use for: Feature selection, detecting multicollinearity.
Handle: Encode categorical features and clean missing data before correlation.

✅

Key Takeaways

Use pandas DataFrame.corr() to compute feature correlation matrix easily.

Correlation values range from -1 to 1 indicating strength and direction of relationships.

Always encode categorical features before computing correlation to avoid errors.

Correlation matrix helps identify redundant features for better model performance.

Handle missing data before correlation to ensure accurate results.