0
0
MlopsConceptBeginner · 3 min read

Bias Variance Tradeoff in Python: Explanation and Example

The bias-variance tradeoff is a balance between a model's simplicity and its ability to fit data well. In Python, using sklearn, it means choosing a model that is not too simple (high bias) or too complex (high variance) to get good predictions on new data.
⚙️

How It Works

Imagine you are trying to guess the weight of apples based on their size. If you use a very simple rule, like always guessing the same weight, your guesses will be off in many cases. This is called high bias, where the model is too simple and misses important details.

On the other hand, if you try to remember every single apple's exact weight and size perfectly, your rule might work great on those apples but fail on new ones. This is called high variance, where the model is too complex and fits noise instead of the real pattern.

The bias-variance tradeoff is about finding the right middle ground so your model learns the true pattern without getting confused by random noise. In Python with sklearn, this means choosing the right model and settings to balance bias and variance for the best predictions.

💻

Example

This example shows how to see bias and variance by training polynomial regression models of different degrees on noisy data using sklearn. Low degree means high bias, high degree means high variance.

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create sample data
np.random.seed(0)
x = np.linspace(0, 1, 30)
y = np.sin(2 * np.pi * x) + np.random.normal(0, 0.3, x.shape)

# Split data into train and test
train_x, test_x = x[:20], x[20:]
train_y, test_y = y[:20], y[20:]

# Function to train and evaluate polynomial regression

def train_and_evaluate(degree):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(train_x.reshape(-1, 1), train_y)
    train_pred = model.predict(train_x.reshape(-1, 1))
    test_pred = model.predict(test_x.reshape(-1, 1))
    train_error = mean_squared_error(train_y, train_pred)
    test_error = mean_squared_error(test_y, test_pred)
    return train_pred, test_pred, train_error, test_error

# Degrees to test
degrees = [1, 3, 9]

plt.figure(figsize=(12, 4))
for i, degree in enumerate(degrees, 1):
    train_pred, test_pred, train_err, test_err = train_and_evaluate(degree)
    plt.subplot(1, 3, i)
    plt.scatter(train_x, train_y, color='blue', label='Train data')
    plt.scatter(test_x, test_y, color='green', label='Test data')
    plt.plot(np.linspace(0, 1, 100), model.predict(np.linspace(0, 1, 100).reshape(-1, 1)), color='red', label='Model prediction')
    plt.title(f'Degree {degree}\nTrain MSE: {train_err:.2f}\nTest MSE: {test_err:.2f}')
    plt.legend()
plt.tight_layout()
plt.show()
Output
A plot with 3 subplots showing polynomial fits of degree 1, 3, and 9. Degree 1 shows underfitting (high bias), degree 9 shows overfitting (high variance), and degree 3 balances both.
🎯

When to Use

Use the bias-variance tradeoff concept when building machine learning models to avoid underfitting or overfitting. It helps you pick the right model complexity and training approach.

For example, in real-world tasks like predicting house prices, customer behavior, or medical diagnoses, balancing bias and variance ensures your model works well on new, unseen data, not just the training data.

Key Points

  • Bias means error from too simple a model.
  • Variance means error from too complex a model fitting noise.
  • The tradeoff balances these to improve prediction accuracy.
  • sklearn tools help test different model complexities.
  • Good models generalize well to new data.

Key Takeaways

The bias-variance tradeoff balances model simplicity and complexity to improve predictions.
High bias causes underfitting; high variance causes overfitting.
Use sklearn to experiment with model complexity and evaluate errors.
Good models generalize well to new data, not just training data.
Understanding this tradeoff helps build reliable machine learning models.