Weight decay helps prevent overfitting by keeping model weights small. The key metrics to watch are validation loss and validation accuracy. If weight decay works well, validation loss should decrease or stay stable while training loss might be higher. This means the model generalizes better to new data.
Weight decay (L2 regularization) in PyTorch - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Weight decay itself does not change the confusion matrix directly. But by reducing overfitting, it helps improve the confusion matrix on validation data. For example, a confusion matrix might look like this:
Predicted
| TP=45 FP=5 |
| FN=10 TN=40 |
Total samples = 100
Here, TP = true positives, FP = false positives, FN = false negatives, TN = true negatives. Weight decay helps improve these numbers by making the model less sensitive to noise.
Weight decay reduces overfitting, which can improve both precision and recall on new data. But if weight decay is too strong, the model may underfit, lowering both precision and recall.
Example:
- Without weight decay: Precision=0.7, Recall=0.6 (overfitting, unstable)
- With moderate weight decay: Precision=0.75, Recall=0.7 (better generalization)
- With too much weight decay: Precision=0.6, Recall=0.5 (underfitting)
Good: Validation loss close to training loss, stable or improving validation accuracy, balanced precision and recall.
Bad: Validation loss much higher than training loss (overfitting), or both losses high (underfitting). Precision or recall very low, showing poor generalization.
- Ignoring validation metrics and only looking at training loss can hide overfitting.
- Using too high weight decay can cause underfitting, making the model too simple.
- Data leakage can falsely improve validation metrics, hiding real overfitting.
- Confusing weight decay with dropout; they help differently.
No, it is not good. High accuracy can be misleading if the data is imbalanced. A 12% recall means the model misses 88% of fraud cases, which is dangerous. Weight decay might help generalize better, but you need to improve recall for fraud detection.
Practice
Solution
Step 1: Understand weight decay concept
Weight decay adds a penalty to large weights during training to prevent the model from fitting noise in the data.Step 2: Connect to overfitting reduction
By keeping weights small, the model generalizes better and avoids overfitting.Final Answer:
To reduce overfitting by penalizing large weights -> Option AQuick Check:
Weight decay = reduces overfitting [OK]
- Confusing weight decay with learning rate changes
- Thinking weight decay adds layers
- Assuming weight decay speeds training
Solution
Step 1: Recall PyTorch optimizer syntax
PyTorch optimizers accept a parameter namedweight_decayto apply L2 regularization.Step 2: Identify correct parameter name
Only optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.001) uses the exact parameterweight_decaycorrectly.Final Answer:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.001) -> Option DQuick Check:
Correct parameter name is weight_decay [OK]
- Using wrong parameter names like decay_weight or wd
- Capitalizing parameter names incorrectly
- Confusing weight decay with learning rate
import torch model = torch.nn.Linear(2, 1) optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=0.01) initial_weight = model.weight.data.clone() optimizer.zero_grad() output = model(torch.tensor([[1.0, 2.0]])) loss = output.sum() loss.backward() optimizer.step() updated_weight = model.weight.data print((initial_weight - updated_weight).abs().sum().item())
What does the printed value represent?
Solution
Step 1: Understand code flow
The code runs one optimizer step with weight decay, then measures how much weights changed.Step 2: Interpret printed value
The printed value is the sum of absolute differences between initial and updated weights, showing total weight change including weight decay effect.Final Answer:
The total change in weights after one optimization step including weight decay -> Option AQuick Check:
Weight change sum = printed value [OK]
- Thinking printed value is loss or learning rate
- Ignoring weight decay effect on weights
- Confusing output sum with weight change
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.1)
But your model is overfitting badly. What is a likely mistake?
Solution
Step 1: Recall weight decay behavior in PyTorch
By default, weight decay is applied to all parameters, including biases and batch norm weights, unless explicitly excluded.Step 2: Understand overfitting cause
If weight decay is applied to all parameters including biases, it may not reduce overfitting effectively because biases are not regularized properly.Final Answer:
Weight decay is applied to biases by default, so overfitting remains -> Option CQuick Check:
Biases often excluded from weight decay for better regularization [OK]
- Assuming weight decay does not apply to biases
- Setting weight decay to zero to fix overfitting
- Blaming learning rate for weight decay issues
Solution
Step 1: Understand selective weight decay
To apply weight decay only to weights, separate parameters into groups with and without weight decay.Step 2: Check code correctness
params = [ {'params': [p for n, p in model.named_parameters() if 'weight' in n], 'weight_decay': 0.01}, {'params': [p for n, p in model.named_parameters() if 'bias' in n], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params, lr=0.001) creates two groups: weights with weight_decay=0.01 and biases with weight_decay=0.0, correctly excluding biases.Final Answer:
Code snippet that separates weights and biases with different weight_decay values -> Option BQuick Check:
Separate params for weight decay control [OK]
- Applying weight decay to all parameters blindly
- Not separating biases from weights
- Using wrong parameter names in filtering
