0
0
MlopsHow-ToBeginner · 3 min read

How to Remove Highly Correlated Features in Python with sklearn

To remove highly correlated features in Python, calculate the correlation matrix using pandas.DataFrame.corr(), then drop features with correlation above a chosen threshold. You can automate this by iterating over the matrix and removing one feature from each highly correlated pair before training your model.
📐

Syntax

Use pandas.DataFrame.corr() to get the correlation matrix of your features. Then, identify pairs with correlation above a threshold (e.g., 0.9). Finally, drop one feature from each pair to reduce redundancy.

Key parts:

  • df.corr(): computes correlation matrix.
  • Threshold: a float value to decide when features are 'highly correlated'.
  • Dropping features: remove columns from your DataFrame.
python
import pandas as pd

def remove_highly_correlated_features(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(
        pd.DataFrame(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool), index=corr_matrix.index, columns=corr_matrix.columns))
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    return df.drop(columns=to_drop)
💻

Example

This example shows how to remove features with correlation above 0.8 from a sample dataset using pandas and sklearn.

python
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

# Load sample data
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)

# Calculate correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(
    pd.DataFrame(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool), index=corr_matrix.index, columns=corr_matrix.columns))

# Find features with correlation greater than 0.8
threshold = 0.8
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

print('Features to drop:', to_drop)

# Drop features
df_reduced = df.drop(columns=to_drop)

print('Original shape:', df.shape)
print('Reduced shape:', df_reduced.shape)
Output
Features to drop: ['NOX', 'DIS', 'RAD', 'TAX', 'PTRATIO'] Original shape: (506, 13) Reduced shape: (506, 8)
⚠️

Common Pitfalls

1. Using a too low or too high threshold: Setting the threshold too low may remove useful features; too high may keep redundant ones.

2. Not using absolute correlation: Correlation can be negative; always use absolute values to catch strong negative correlations.

3. Dropping features without domain knowledge: Sometimes correlated features carry unique information; consider feature importance before dropping.

python
import pandas as pd
import numpy as np

# Wrong: Using raw correlation without absolute value
corr_matrix = df.corr()
upper = corr_matrix.where(
    pd.DataFrame(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool), index=corr_matrix.index, columns=corr_matrix.columns))

# This misses strong negative correlations

# Right: Use absolute correlation
corr_matrix_abs = df.corr().abs()
upper_abs = corr_matrix_abs.where(
    pd.DataFrame(np.triu(np.ones(corr_matrix_abs.shape), k=1).astype(bool), index=corr_matrix_abs.index, columns=corr_matrix_abs.columns))
📊

Quick Reference

  • Calculate correlation matrix with df.corr().abs().
  • Use upper triangle to avoid duplicate pairs.
  • Set a threshold (e.g., 0.8 or 0.9) to identify highly correlated features.
  • Drop one feature from each correlated pair.
  • Check feature importance before dropping.

Key Takeaways

Calculate the absolute correlation matrix to find highly correlated features.
Use the upper triangle of the matrix to avoid duplicate checks.
Set a sensible threshold like 0.8 or 0.9 to decide which features to drop.
Always consider domain knowledge or feature importance before removing features.
Dropping highly correlated features helps reduce redundancy and improve model performance.