Bias Variance Tradeoff in Python: Explanation and Example
bias-variance tradeoff is a balance between a model's simplicity and its ability to fit data well. In Python, using sklearn, it means choosing a model that is not too simple (high bias) or too complex (high variance) to get good predictions on new data.How It Works
Imagine you are trying to guess the weight of apples based on their size. If you use a very simple rule, like always guessing the same weight, your guesses will be off in many cases. This is called high bias, where the model is too simple and misses important details.
On the other hand, if you try to remember every single apple's exact weight and size perfectly, your rule might work great on those apples but fail on new ones. This is called high variance, where the model is too complex and fits noise instead of the real pattern.
The bias-variance tradeoff is about finding the right middle ground so your model learns the true pattern without getting confused by random noise. In Python with sklearn, this means choosing the right model and settings to balance bias and variance for the best predictions.
Example
This example shows how to see bias and variance by training polynomial regression models of different degrees on noisy data using sklearn. Low degree means high bias, high degree means high variance.
import numpy as np import matplotlib.pyplot as plt from sklearn.pipeline import make_pipeline from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Create sample data np.random.seed(0) x = np.linspace(0, 1, 30) y = np.sin(2 * np.pi * x) + np.random.normal(0, 0.3, x.shape) # Split data into train and test train_x, test_x = x[:20], x[20:] train_y, test_y = y[:20], y[20:] # Function to train and evaluate polynomial regression def train_and_evaluate(degree): model = make_pipeline(PolynomialFeatures(degree), LinearRegression()) model.fit(train_x.reshape(-1, 1), train_y) train_pred = model.predict(train_x.reshape(-1, 1)) test_pred = model.predict(test_x.reshape(-1, 1)) train_error = mean_squared_error(train_y, train_pred) test_error = mean_squared_error(test_y, test_pred) return train_pred, test_pred, train_error, test_error # Degrees to test degrees = [1, 3, 9] plt.figure(figsize=(12, 4)) for i, degree in enumerate(degrees, 1): train_pred, test_pred, train_err, test_err = train_and_evaluate(degree) plt.subplot(1, 3, i) plt.scatter(train_x, train_y, color='blue', label='Train data') plt.scatter(test_x, test_y, color='green', label='Test data') plt.plot(np.linspace(0, 1, 100), model.predict(np.linspace(0, 1, 100).reshape(-1, 1)), color='red', label='Model prediction') plt.title(f'Degree {degree}\nTrain MSE: {train_err:.2f}\nTest MSE: {test_err:.2f}') plt.legend() plt.tight_layout() plt.show()
When to Use
Use the bias-variance tradeoff concept when building machine learning models to avoid underfitting or overfitting. It helps you pick the right model complexity and training approach.
For example, in real-world tasks like predicting house prices, customer behavior, or medical diagnoses, balancing bias and variance ensures your model works well on new, unseen data, not just the training data.
Key Points
- Bias means error from too simple a model.
- Variance means error from too complex a model fitting noise.
- The tradeoff balances these to improve prediction accuracy.
sklearntools help test different model complexities.- Good models generalize well to new data.