0
0
ML Pythonprogramming~5 mins

Overfitting and underfitting in ML Python

Choose your learning style9 modes available
Introduction

Overfitting and underfitting explain why a model might not work well on new data. They help us understand if the model learned too much or too little from the training data.

When your model performs very well on training data but poorly on new data.
When your model performs poorly both on training and new data.
When you want to check if your model is too simple or too complex for the problem.
When tuning model settings to improve prediction accuracy.
When deciding how much data or features to use for training.
Syntax
ML Python
No specific code syntax; these are concepts to check during model training and evaluation.

Overfitting means the model memorizes training data details and noise.

Underfitting means the model is too simple to capture the data patterns.

Examples
A very deep decision tree can memorize training data, causing overfitting.
ML Python
# Overfitting example
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=None)  # very deep tree
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
A very shallow decision tree may be too simple, causing underfitting.
ML Python
# Underfitting example
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=1)  # very shallow tree
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
Sample Program

This code trains two decision tree models on the iris dataset. One is very deep (likely overfitting), the other is very shallow (likely underfitting). It prints their accuracy on training and test data to show the difference.

ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Overfitting model: very deep tree
model_overfit = DecisionTreeClassifier(max_depth=None, random_state=42)
model_overfit.fit(X_train, y_train)
train_score_overfit = model_overfit.score(X_train, y_train)
test_score_overfit = model_overfit.score(X_test, y_test)

# Underfitting model: very shallow tree
model_underfit = DecisionTreeClassifier(max_depth=1, random_state=42)
model_underfit.fit(X_train, y_train)
train_score_underfit = model_underfit.score(X_train, y_train)
test_score_underfit = model_underfit.score(X_test, y_test)

print(f"Overfitting model - Train accuracy: {train_score_overfit:.2f}")
print(f"Overfitting model - Test accuracy: {test_score_overfit:.2f}")
print(f"Underfitting model - Train accuracy: {train_score_underfit:.2f}")
print(f"Underfitting model - Test accuracy: {test_score_underfit:.2f}")
OutputSuccess
Important Notes

Overfitting usually shows very high training accuracy but lower test accuracy.

Underfitting shows low accuracy on both training and test data.

Use techniques like cross-validation, pruning, or simpler models to avoid overfitting.

Summary

Overfitting means the model learns too much noise and details from training data.

Underfitting means the model is too simple and misses important patterns.

Good models balance learning enough to predict well on new data without memorizing training data.