How to Fix Overfitting in ML Model in Python with sklearn
sklearn, use techniques like train_test_split for validation, regularization (e.g., Ridge or Lasso), or early stopping with models that support it.Why This Happens
Overfitting occurs when a model learns not only the true patterns but also the noise in the training data. This makes the model perform very well on training data but poorly on new, unseen data.
Here is an example of a model that overfits because it is too complex and trained without validation:
from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Load data X, y = load_boston(return_X_y=True) # Train on all data without splitting model = LinearRegression() model.fit(X, y) predictions = model.predict(X) mse = mean_squared_error(y, predictions) print(f"Training MSE: {mse:.2f}")
The Fix
To fix overfitting, split data into training and testing sets to check performance on unseen data. Use regularization to limit model complexity. Here, we use train_test_split and Ridge regression with regularization:
from sklearn.datasets import load_boston from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Load data X, y = load_boston(return_X_y=True) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Use Ridge regression with regularization model = Ridge(alpha=1.0) # alpha controls regularization strength model.fit(X_train, y_train) # Predict on train and test train_pred = model.predict(X_train) test_pred = model.predict(X_test) train_mse = mean_squared_error(y_train, train_pred) test_mse = mean_squared_error(y_test, test_pred) print(f"Train MSE: {train_mse:.2f}") print(f"Test MSE: {test_mse:.2f}")
Prevention
To avoid overfitting in the future, always split your data into training and testing sets. Use cross-validation to check model stability. Apply regularization methods like Ridge or Lasso to keep models simple. Also, consider reducing features or using simpler models if data is small.
Regularly monitor training and validation errors to detect overfitting early.
Related Errors
Similar issues include underfitting, where the model is too simple and performs poorly on both training and test data. To fix underfitting, increase model complexity or add features.
Another related problem is data leakage, where information from test data leaks into training, causing misleadingly good results.