0
0
MlopsHow-ToBeginner · 4 min read

How to Detect Overfitting in Python with sklearn

To detect overfitting in Python, use sklearn to compare your model's performance on training and validation data. If the training score is much higher than the validation score, it indicates overfitting. Visualizing learning curves with sklearn.model_selection.learning_curve also helps spot overfitting.
📐

Syntax

Use train_test_split to split data, fit to train the model, and score to evaluate performance. Use learning_curve to plot training and validation scores over different training sizes.

  • train_test_split(X, y, test_size=0.2): splits data into training and testing sets.
  • model.fit(X_train, y_train): trains the model on training data.
  • model.score(X_test, y_test): evaluates model accuracy on test data.
  • learning_curve(model, X, y, cv=5): returns training and validation scores for plotting.
python
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LogisticRegression

# Assume X, y are defined
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

# Get learning curve data
train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5)
💻

Example

This example trains a logistic regression model on the iris dataset, compares training and test accuracy, and plots learning curves to detect overfitting.

python
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, learning_curve
import numpy as np

# Load data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Scores
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training accuracy: {train_score:.2f}")
print(f"Test accuracy: {test_score:.2f}")

# Learning curve
train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 5))

train_mean = np.mean(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)

plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training score')
plt.plot(train_sizes, val_mean, 'o-', color='green', label='Validation score')
plt.title('Learning Curve')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
Output
Training accuracy: 0.98 Test accuracy: 0.98
⚠️

Common Pitfalls

Common mistakes when detecting overfitting include:

  • Only checking training accuracy without validation or test accuracy.
  • Using the same data for training and testing, which hides overfitting.
  • Ignoring learning curves that show a big gap between training and validation scores.

Always use separate validation or test sets and visualize learning curves to confirm overfitting.

python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Wrong: training and testing on same data hides overfitting
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
model.fit(X, y)
print(f"Accuracy on same data: {model.score(X, y):.2f}")  # High but misleading

# Right: split data to detect overfitting
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)
print(f"Train accuracy: {model.score(X_train, y_train):.2f}")
print(f"Test accuracy: {model.score(X_test, y_test):.2f}")
Output
Accuracy on same data: 0.97 Train accuracy: 0.98 Test accuracy: 0.98
📊

Quick Reference

Tips to detect overfitting in Python with sklearn:

  • Split your data into training and test sets using train_test_split.
  • Compare training and test accuracy; a large gap means overfitting.
  • Use learning_curve to plot training vs validation scores over increasing data sizes.
  • Consider cross-validation for more reliable validation scores.
  • Regularize your model or get more data if overfitting is detected.

Key Takeaways

Compare training and validation/test scores to spot overfitting.
Use learning curves to visualize model performance on different data sizes.
Always split data into separate training and test sets.
High training accuracy but low test accuracy signals overfitting.
Cross-validation helps get reliable validation performance estimates.