Regularization helps a model avoid memorizing training data too much. It keeps the model simple so it can work well on new data.
Why regularization controls overfitting in PyTorch
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
PyTorch
loss = criterion(output, target) + lambda_ * regularization_term
The regularization term adds a penalty to the loss.
Common regularizations are L1 (sum of absolute weights) and L2 (sum of squared weights).
Examples
PyTorch
l2_lambda = 0.01 l2_norm = sum(p.pow(2.0).sum() for p in model.parameters()) loss = criterion(output, target) + l2_lambda * l2_norm
PyTorch
l1_lambda = 0.005 l1_norm = sum(p.abs().sum() for p in model.parameters()) loss = criterion(output, target) + l1_lambda * l1_norm
Sample Model
This code trains a small neural network on the XOR problem with L2 regularization to prevent overfitting. It prints the final loss and rounded predictions.
PyTorch
import torch import torch.nn as nn import torch.optim as optim # Simple model class SimpleNet(nn.Module): def __init__(self): super().__init__() self.fc = nn.Linear(2, 1) def forward(self, x): return self.fc(x) # Data: XOR problem inputs = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32) targets = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32) model = SimpleNet() criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.1) l2_lambda = 0.1 for epoch in range(100): optimizer.zero_grad() outputs = model(inputs) mse_loss = criterion(outputs, targets) l2_norm = sum(p.pow(2).sum() for p in model.parameters()) loss = mse_loss + l2_lambda * l2_norm loss.backward() optimizer.step() # Print final loss and predictions with torch.no_grad(): preds = model(inputs) final_loss = criterion(preds, targets).item() print(f"Final MSE Loss: {final_loss:.4f}") print("Predictions:") print(preds.round())
Important Notes
Regularization adds a small penalty to large weights, encouraging simpler models.
Too much regularization can make the model too simple and underfit.
Common regularization methods include L1, L2, and dropout.
Summary
Regularization helps control overfitting by keeping model weights small.
It adds a penalty term to the loss function during training.
This leads to better performance on new, unseen data.
Practice
1. Why does regularization help prevent overfitting in a PyTorch model?
easy
Solution
Step 1: Understand what overfitting means
Overfitting happens when a model learns the training data too well, including noise, causing poor performance on new data.Step 2: Explain how regularization affects model weights
Regularization adds a penalty to large weights, encouraging smaller weights that generalize better to new data.Final Answer:
It keeps the model weights small by adding a penalty to the loss. -> Option AQuick Check:
Regularization = penalty on weights = less overfitting [OK]
Hint: Regularization adds penalty to weights to reduce overfitting [OK]
Common Mistakes:
- Thinking regularization increases data size
- Believing regularization removes layers
- Assuming regularization speeds training
2. Which PyTorch code snippet correctly applies L2 regularization (weight decay) during optimizer setup?
easy
Solution
Step 1: Identify correct parameter for L2 regularization in PyTorch
PyTorch usesweight_decayin optimizers to apply L2 regularization.Step 2: Check the code options for correct usage
Only optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.1) usesweight_decay=0.1, which is the correct way to add L2 regularization.Final Answer:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.1) -> Option CQuick Check:
weight_decay = L2 regularization in PyTorch [OK]
Hint: Use weight_decay param for L2 regularization in PyTorch optimizers [OK]
Common Mistakes:
- Using dropout parameter in optimizer
- Confusing momentum with regularization
- Using decay instead of weight_decay
3. Consider this PyTorch training loop snippet with L2 regularization applied:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)
for data, target in dataloader:
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
What effect does the weight_decay=0.01 have during training?medium
Solution
Step 1: Understand weight_decay in optimizer
Theweight_decayparameter adds L2 regularization, penalizing large weights during training.Step 2: Identify the effect on training
This penalty helps the model avoid overfitting by keeping weights smaller and more generalizable.Final Answer:
It adds a penalty to large weights, helping reduce overfitting. -> Option AQuick Check:
weight_decay = L2 penalty = less overfitting [OK]
Hint: weight_decay adds penalty to weights, not learning rate or dropout [OK]
Common Mistakes:
- Confusing weight_decay with learning rate changes
- Thinking weight_decay is dropout
- Assuming weight_decay controls early stopping
4. You have this PyTorch code snippet intended to apply L2 regularization:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for data, target in dataloader:
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target) + 0.01 * torch.sum(model.parameters())
loss.backward()
optimizer.step()
What is wrong with this code regarding regularization?medium
Solution
Step 1: Check how L2 regularization is computed
L2 regularization requires summing the squares of parameters, not just their values.Step 2: Analyze the code's regularization term
The code sums parameters directly withtorch.sum(model.parameters()), which is incorrect for L2 penalty.Final Answer:
It incorrectly sums parameters instead of their squares for L2 penalty. -> Option DQuick Check:
L2 penalty = sum of squares, not sum of values [OK]
Hint: L2 regularization sums squares of weights, not weights themselves [OK]
Common Mistakes:
- Summing parameters instead of squared parameters
- Thinking SGD can't use regularization
- Misplacing optimizer.zero_grad() call
5. You train two PyTorch models on the same dataset: Model A uses no regularization, Model B uses L2 regularization with weight_decay=0.05. After training, Model A has training accuracy 98% but test accuracy 70%, while Model B has training accuracy 90% and test accuracy 85%. What explains this difference?
hard
Solution
Step 1: Compare training and test accuracies
Model A fits training data very well but performs poorly on test data, indicating overfitting.Step 2: Understand effect of L2 regularization on Model B
Model B has lower training accuracy but better test accuracy because regularization keeps weights smaller, improving generalization.Final Answer:
Model B's regularization reduced overfitting by keeping weights smaller, improving test accuracy. -> Option BQuick Check:
Regularization = smaller weights = better test accuracy [OK]
Hint: Better test accuracy with regularization means less overfitting [OK]
Common Mistakes:
- Assuming higher training accuracy means better test accuracy
- Confusing learning rate with regularization effect
- Ignoring the role of weight size in generalization
