How to Remove Highly Correlated Features in Python with sklearn
To remove highly correlated features in Python, calculate the correlation matrix using
pandas.DataFrame.corr(), then drop features with correlation above a chosen threshold. You can automate this by iterating over the matrix and removing one feature from each highly correlated pair before training your model.Syntax
Use pandas.DataFrame.corr() to get the correlation matrix of your features. Then, identify pairs with correlation above a threshold (e.g., 0.9). Finally, drop one feature from each pair to reduce redundancy.
Key parts:
df.corr(): computes correlation matrix.- Threshold: a float value to decide when features are 'highly correlated'.
- Dropping features: remove columns from your DataFrame.
python
import pandas as pd def remove_highly_correlated_features(df, threshold=0.9): corr_matrix = df.corr().abs() upper = corr_matrix.where( pd.DataFrame(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool), index=corr_matrix.index, columns=corr_matrix.columns)) to_drop = [column for column in upper.columns if any(upper[column] > threshold)] return df.drop(columns=to_drop)
Example
This example shows how to remove features with correlation above 0.8 from a sample dataset using pandas and sklearn.
python
import pandas as pd import numpy as np from sklearn.datasets import load_boston # Load sample data boston = load_boston() df = pd.DataFrame(boston.data, columns=boston.feature_names) # Calculate correlation matrix corr_matrix = df.corr().abs() # Select upper triangle of correlation matrix upper = corr_matrix.where( pd.DataFrame(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool), index=corr_matrix.index, columns=corr_matrix.columns)) # Find features with correlation greater than 0.8 threshold = 0.8 to_drop = [column for column in upper.columns if any(upper[column] > threshold)] print('Features to drop:', to_drop) # Drop features df_reduced = df.drop(columns=to_drop) print('Original shape:', df.shape) print('Reduced shape:', df_reduced.shape)
Output
Features to drop: ['NOX', 'DIS', 'RAD', 'TAX', 'PTRATIO']
Original shape: (506, 13)
Reduced shape: (506, 8)
Common Pitfalls
1. Using a too low or too high threshold: Setting the threshold too low may remove useful features; too high may keep redundant ones.
2. Not using absolute correlation: Correlation can be negative; always use absolute values to catch strong negative correlations.
3. Dropping features without domain knowledge: Sometimes correlated features carry unique information; consider feature importance before dropping.
python
import pandas as pd import numpy as np # Wrong: Using raw correlation without absolute value corr_matrix = df.corr() upper = corr_matrix.where( pd.DataFrame(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool), index=corr_matrix.index, columns=corr_matrix.columns)) # This misses strong negative correlations # Right: Use absolute correlation corr_matrix_abs = df.corr().abs() upper_abs = corr_matrix_abs.where( pd.DataFrame(np.triu(np.ones(corr_matrix_abs.shape), k=1).astype(bool), index=corr_matrix_abs.index, columns=corr_matrix_abs.columns))
Quick Reference
- Calculate correlation matrix with
df.corr().abs(). - Use upper triangle to avoid duplicate pairs.
- Set a threshold (e.g., 0.8 or 0.9) to identify highly correlated features.
- Drop one feature from each correlated pair.
- Check feature importance before dropping.
Key Takeaways
Calculate the absolute correlation matrix to find highly correlated features.
Use the upper triangle of the matrix to avoid duplicate checks.
Set a sensible threshold like 0.8 or 0.9 to decide which features to drop.
Always consider domain knowledge or feature importance before removing features.
Dropping highly correlated features helps reduce redundancy and improve model performance.