How to Handle Multicollinearity in Python with sklearn
Variance Inflation Factor (VIF) to detect it and then remove or combine correlated features or use regularization methods like Ridge regression to handle it.Why This Happens
Multicollinearity occurs when two or more features in your dataset are strongly related. This makes it hard for models to decide which feature is important, leading to unstable or misleading results.
Here is an example where multicollinearity causes issues in a linear regression model:
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression # Create data with multicollinearity np.random.seed(0) x1 = np.random.rand(100) x2 = x1 * 0.95 + np.random.rand(100) * 0.05 # Highly correlated with x1 y = 3 * x1 + 2 * x2 + np.random.randn(100) * 0.1 data = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y}) model = LinearRegression() model.fit(data[['x1', 'x2']], data['y']) print(f'Coefficients: x1={model.coef_[0]:.3f}, x2={model.coef_[1]:.3f}')
The Fix
To fix multicollinearity, first detect it using the Variance Inflation Factor (VIF). Then remove or combine features with high VIF values. Alternatively, use regularization like Ridge regression that reduces coefficient instability.
import numpy as np import pandas as pd from sklearn.linear_model import Ridge from statsmodels.stats.outliers_influence import variance_inflation_factor # Data same as before np.random.seed(0) x1 = np.random.rand(100) x2 = x1 * 0.95 + np.random.rand(100) * 0.05 y = 3 * x1 + 2 * x2 + np.random.randn(100) * 0.1 data = pd.DataFrame({'x1': x1, 'x2': x2}) # Calculate VIF vif_data = pd.DataFrame() vif_data['feature'] = data.columns vif_data['VIF'] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])] print(vif_data) # Remove x2 due to high VIF X_reduced = data[['x1']] # Fit Ridge regression model = Ridge(alpha=1.0) model.fit(X_reduced, y) print(f'Ridge coefficient for x1: {model.coef_[0]:.3f}')
Prevention
To avoid multicollinearity problems, always check feature correlations before modeling. Use VIF to detect it early. Consider removing or combining correlated features or use models with regularization like Ridge or Lasso. Also, scale your features to help regularization work better.
Related Errors
Other issues related to multicollinearity include:
- Overfitting: Model fits noise due to redundant features.
- Unstable coefficients: Small data changes cause big coefficient swings.
- High standard errors: Coefficients become statistically insignificant.
Quick fixes include feature selection, dimensionality reduction (like PCA), or regularization.