Overview - Overfitting and underfitting

What is it?

Overfitting and underfitting describe how well a machine learning model learns from data. Overfitting happens when a model learns too much detail, including noise, making it bad at new data. Underfitting happens when a model learns too little, missing important patterns. Both cause poor predictions on data the model hasn't seen before.

Why it matters

These problems matter because they affect how useful a model is in real life. If a model overfits, it looks perfect on training data but fails in practice. If it underfits, it never learns enough to be helpful. Without understanding these, models would be unreliable, wasting time and resources and possibly causing wrong decisions.

Where it fits

Before learning this, you should know basic machine learning concepts like training data, models, and predictions. After this, you can learn about techniques to fix these problems, like regularization, cross-validation, and model selection.

Mental Model

Core Idea

A good model balances learning enough from data to predict well without memorizing noise or missing key patterns.

Think of it like...

It's like studying for a test: overfitting is memorizing every example question without understanding, so you fail new questions; underfitting is not studying enough, so you don't know the material well.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│   Underfitting │──────▶│   Good Fit    │──────▶│  Overfitting  │
│ (too simple)  │       │ (balanced)    │       │ (too complex) │
└───────────────┘       └───────────────┘       └───────────────┘
       ▲                      ▲                      ▲
       │                      │                      │
   High error             Low error             High error
   on train & test       on train & test      on train, high on test

Build-Up - 7 Steps

1

FoundationWhat is model fitting in ML

Concept: Understanding what it means for a model to fit data.

In machine learning, fitting means the model learns patterns from data. The model looks at input data and tries to find rules to predict outputs. The better it fits, the closer its predictions match the real answers on the training data.

Result

You get a model that can predict outputs for inputs it has seen.

Understanding fitting is the base for knowing why too much or too little fitting causes problems.

2

FoundationDifference between training and test data

3

IntermediateWhat causes underfitting in models

4

IntermediateWhat causes overfitting in models

5

IntermediateMeasuring fit with error metrics

6

AdvancedBalancing bias and variance tradeoff

7

ExpertWhy overfitting happens in high dimensions

Under the Hood

Models learn by adjusting parameters to reduce error on training data. Overfitting occurs when parameters fit noise, causing complex decision boundaries or functions. Underfitting happens when parameters are too limited to capture true data patterns. The balance depends on model capacity, data size, and noise level.

Why designed this way?

Machine learning models were designed to find patterns in data, but early methods lacked ways to control complexity, causing overfitting. Techniques like regularization and validation sets were introduced to balance learning and generalization. This design reflects the need to handle real-world noisy data effectively.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│   Data Input  │──────▶│ Model Training│──────▶│ Parameter Fit │
└───────────────┘       └───────────────┘       └───────────────┘
                                │                      │
                                ▼                      ▼
                       ┌───────────────┐      ┌────────────────┐
                       │  Model Output │◀─────│  Error Measure │
                       └───────────────┘      └────────────────┘
                                ▲                      │
                                └──────────────┬───────┘
                                               ▼
                                      Adjust Parameters
                                      (fit better or worse)

Myth Busters - 4 Common Misconceptions

Quick: Does a model with zero training error always perform best on new data? Commit yes or no.

Common Belief:If a model fits training data perfectly, it must be the best model.

Tap to reveal reality

Quick: Is underfitting only a problem for very simple models? Commit yes or no.

Common Belief:Underfitting only happens if the model is too simple, like linear regression on complex data.

Tap to reveal reality

Quick: Does adding more data always fix overfitting? Commit yes or no.

Common Belief:More data always solves overfitting problems.

Tap to reveal reality

Quick: Can a model be both overfitting and underfitting at the same time? Commit yes or no.

Common Belief:A model cannot be overfitting and underfitting simultaneously.

Tap to reveal reality

Expert Zone

1

Regularization strength must be carefully tuned; too strong causes underfitting, too weak allows overfitting.

2

Early stopping during training is a practical way to prevent overfitting by monitoring validation error.

3

Data augmentation can reduce overfitting by increasing effective training data diversity without new samples.

When NOT to use

Avoid relying solely on complex models to fix underfitting; sometimes better features or simpler models with proper tuning work best. For overfitting, alternatives include simpler models, pruning, or Bayesian methods instead of just regularization.

Production Patterns

In real systems, cross-validation is used to detect overfitting early. Pipelines include feature selection, regularization, and monitoring test error continuously. Ensembles combine models to reduce overfitting risk.

Connections

Bias-Variance Tradeoff

Overfitting and underfitting are practical outcomes of the bias-variance tradeoff.

Understanding bias and variance helps explain why models fail to generalize and guides model complexity choices.

Human Learning and Memory

Similar to how humans learn, overfitting is like rote memorization, underfitting like not paying attention.

This connection shows that learning well means balancing memorization and understanding, a universal principle.

Signal Processing Noise Filtering

Overfitting is like mistaking noise for signal; underfitting is like filtering out important signals.

Knowing noise filtering helps understand how models must separate true patterns from random fluctuations.

Common Pitfalls

#1Ignoring test data performance and trusting only training accuracy.

Wrong approach:model.fit(X_train, y_train) print('Training accuracy:', model.score(X_train, y_train)) # No test evaluation

Correct approach:model.fit(X_train, y_train) print('Training accuracy:', model.score(X_train, y_train)) print('Test accuracy:', model.score(X_test, y_test))

Root cause:Misunderstanding that training accuracy alone does not reflect real-world performance.

#2Using an overly complex model on small data without regularization.

Wrong approach:model = ComplexModel() model.fit(small_data, labels) # No regularization or validation

Correct approach:model = ComplexModel(regularization=0.1) model.fit(small_data, labels) # Use validation to monitor overfitting

Root cause:Not recognizing that model complexity must match data size and quality.

#3Stopping training too early causing underfitting.

Wrong approach:for epoch in range(1): # Only one epoch model.train_one_epoch()

Correct approach:for epoch in range(50): # Train enough epochs model.train_one_epoch() if validation_error_increases: break # Early stopping

Root cause:Misunderstanding training time needed to learn patterns fully.

Key Takeaways

Overfitting means a model learns too much noise, hurting new data predictions; underfitting means it learns too little, missing patterns.

Balancing model complexity and data quality is key to good generalization.

Measuring errors on both training and test data reveals fitting problems clearly.

The bias-variance tradeoff explains why models fail to generalize and guides tuning.

Real-world solutions include regularization, validation, early stopping, and data augmentation.