MlopsDebug / FixBeginner · 4 min read

How to Fix Overfitting in ML Model in Python with sklearn

Overfitting happens when a model learns the training data too well, including noise, causing poor performance on new data. To fix this in Python with sklearn, use techniques like train_test_split for validation, regularization (e.g., Ridge or Lasso), or early stopping with models that support it.

🔍

Why This Happens

Overfitting occurs when a model learns not only the true patterns but also the noise in the training data. This makes the model perform very well on training data but poorly on new, unseen data.

Here is an example of a model that overfits because it is too complex and trained without validation:

python

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data
X, y = load_boston(return_X_y=True)

# Train on all data without splitting
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)

mse = mean_squared_error(y, predictions)
print(f"Training MSE: {mse:.2f}")

Output

Training MSE: 21.89

🔧

The Fix

To fix overfitting, split data into training and testing sets to check performance on unseen data. Use regularization to limit model complexity. Here, we use train_test_split and Ridge regression with regularization:

python

from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
X, y = load_boston(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use Ridge regression with regularization
model = Ridge(alpha=1.0)  # alpha controls regularization strength
model.fit(X_train, y_train)

# Predict on train and test
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

train_mse = mean_squared_error(y_train, train_pred)
test_mse = mean_squared_error(y_test, test_pred)

print(f"Train MSE: {train_mse:.2f}")
print(f"Test MSE: {test_mse:.2f}")

Output

Train MSE: 22.30 Test MSE: 26.15

🛡️

Prevention

To avoid overfitting in the future, always split your data into training and testing sets. Use cross-validation to check model stability. Apply regularization methods like Ridge or Lasso to keep models simple. Also, consider reducing features or using simpler models if data is small.

Regularly monitor training and validation errors to detect overfitting early.

⚠️

Related Errors

Similar issues include underfitting, where the model is too simple and performs poorly on both training and test data. To fix underfitting, increase model complexity or add features.

Another related problem is data leakage, where information from test data leaks into training, causing misleadingly good results.

✅

Key Takeaways

Always split data into training and testing sets to detect overfitting.

Use regularization like Ridge or Lasso to reduce model complexity.

Monitor training and validation errors to catch overfitting early.

Simplify models or reduce features if overfitting persists.

Beware of data leakage that can hide overfitting problems.