Weight decay is often used during training of neural networks. What does it mainly help with?
Think about what happens when weights become very large and how that affects generalization.
Weight decay adds a penalty to large weights, encouraging the model to keep weights small. This reduces overfitting and helps the model generalize better.
Consider this PyTorch code snippet training a simple linear model with weight decay. What will be printed?
import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(1, 1) optimizer = optim.SGD(model.parameters(), lr=0.1, weight_decay=0.1) x = torch.tensor([[1.0]]) y = torch.tensor([[2.0]]) criterion = nn.MSELoss() for _ in range(1): optimizer.zero_grad() output = model(x) loss = criterion(output, y) loss.backward() optimizer.step() print(round(loss.item(), 3))
Think about the initial random weights and one step of gradient descent with weight decay.
The initial loss is roughly the squared difference between output and target. Weight decay slightly affects the update but loss after one step remains close to 0.25.
You want to apply weight decay only to the weights of a neural network, not to the bias terms. Which optimizer setup below does this correctly?
Think about how to separate parameters by name and assign different weight decay values.
Option A correctly separates bias parameters and applies zero weight decay to them, while applying weight decay to other parameters.
If you increase the weight decay value excessively, what is the most likely effect on the model's training?
Consider what happens if the penalty on weights is very strong.
Too much weight decay forces weights to be very small, limiting the model's ability to learn patterns, causing underfitting.
Look at this PyTorch code snippet. The user expects weight decay to be applied, but the model's weights do not shrink. What is the bug?
import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(2, 1) optimizer = optim.SGD(model.parameters(), lr=0.1, weight_decay=0.01) x = torch.tensor([[1.0, 2.0]]) y = torch.tensor([[1.0]]) criterion = nn.MSELoss() for _ in range(10): optimizer.zero_grad() output = model(x) loss = criterion(output, y) loss.backward() optimizer.step() print(model.weight)
Check which parameters are updated and how weight decay affects weights vs biases.
Weight decay applies only to weights, not biases. If only biases are updated or weights remain unchanged, weight decay effect is not visible.