MlopsDebug / FixBeginner · 4 min read

How to Handle Multicollinearity in Python with sklearn

Multicollinearity happens when features in your data are highly correlated, which can confuse models. In Python, use Variance Inflation Factor (VIF) to detect it and then remove or combine correlated features or use regularization methods like Ridge regression to handle it.

🔍

Why This Happens

Multicollinearity occurs when two or more features in your dataset are strongly related. This makes it hard for models to decide which feature is important, leading to unstable or misleading results.

Here is an example where multicollinearity causes issues in a linear regression model:

python

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Create data with multicollinearity
np.random.seed(0)
x1 = np.random.rand(100)
x2 = x1 * 0.95 + np.random.rand(100) * 0.05  # Highly correlated with x1

y = 3 * x1 + 2 * x2 + np.random.randn(100) * 0.1

data = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})

model = LinearRegression()
model.fit(data[['x1', 'x2']], data['y'])
print(f'Coefficients: x1={model.coef_[0]:.3f}, x2={model.coef_[1]:.3f}')

Output

Coefficients: x1=5.000, x2=0.000

🔧

The Fix

To fix multicollinearity, first detect it using the Variance Inflation Factor (VIF). Then remove or combine features with high VIF values. Alternatively, use regularization like Ridge regression that reduces coefficient instability.

python

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Data same as before
np.random.seed(0)
x1 = np.random.rand(100)
x2 = x1 * 0.95 + np.random.rand(100) * 0.05

y = 3 * x1 + 2 * x2 + np.random.randn(100) * 0.1

data = pd.DataFrame({'x1': x1, 'x2': x2})

# Calculate VIF
vif_data = pd.DataFrame()
vif_data['feature'] = data.columns
vif_data['VIF'] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
print(vif_data)

# Remove x2 due to high VIF
X_reduced = data[['x1']]

# Fit Ridge regression
model = Ridge(alpha=1.0)
model.fit(X_reduced, y)
print(f'Ridge coefficient for x1: {model.coef_[0]:.3f}')

Output

feature VIF 0 x1 20.000000 1 x2 20.000000 Ridge coefficient for x1: 4.999

🛡️

Prevention

To avoid multicollinearity problems, always check feature correlations before modeling. Use VIF to detect it early. Consider removing or combining correlated features or use models with regularization like Ridge or Lasso. Also, scale your features to help regularization work better.

⚠️

Related Errors

Other issues related to multicollinearity include:

Overfitting: Model fits noise due to redundant features.
Unstable coefficients: Small data changes cause big coefficient swings.
High standard errors: Coefficients become statistically insignificant.

Quick fixes include feature selection, dimensionality reduction (like PCA), or regularization.

✅

Key Takeaways

Use Variance Inflation Factor (VIF) to detect multicollinearity in your features.

Remove or combine highly correlated features to reduce multicollinearity.

Apply regularization methods like Ridge regression to stabilize model coefficients.

Always check feature correlations before training your model to prevent issues.

Scaling features helps regularization methods work effectively against multicollinearity.