Bird
Raised Fist0
PyTorchml~20 mins

Weight decay (L2 regularization) in PyTorch - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Weight Decay Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
What is the main purpose of weight decay (L2 regularization) in training neural networks?

Weight decay is often used during training of neural networks. What does it mainly help with?

AIt speeds up the training by increasing the learning rate automatically.
BIt helps the model memorize the training data perfectly.
CIt prevents the model from overfitting by penalizing large weights.
DIt changes the activation functions to nonlinear ones.
Attempts:
2 left
💡 Hint

Think about what happens when weights become very large and how that affects generalization.

Predict Output
intermediate
2:00remaining
What is the output of this PyTorch training loop snippet with weight decay?

Consider this PyTorch code snippet training a simple linear model with weight decay. What will be printed?

PyTorch
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(1, 1)
optimizer = optim.SGD(model.parameters(), lr=0.1, weight_decay=0.1)

x = torch.tensor([[1.0]])
y = torch.tensor([[2.0]])

criterion = nn.MSELoss()

for _ in range(1):
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()

print(round(loss.item(), 3))
A0.0
B0.25
C0.81
D1.0
Attempts:
2 left
💡 Hint

Think about the initial random weights and one step of gradient descent with weight decay.

Model Choice
advanced
2:30remaining
Which optimizer setup correctly applies weight decay only to weights but not biases in PyTorch?

You want to apply weight decay only to the weights of a neural network, not to the bias terms. Which optimizer setup below does this correctly?

A
optimizer = optim.Adam([
    {'params': [p for n, p in model.named_parameters() if 'bias' not in n], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if 'bias' in n], 'weight_decay': 0.0}
], lr=0.001)
Boptimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)
Coptimizer = optim.Adam(model.parameters(), lr=0.001)
D
optimizer = optim.Adam([
    {'params': model.parameters(), 'weight_decay': 0.0}
], lr=0.001)
Attempts:
2 left
💡 Hint

Think about how to separate parameters by name and assign different weight decay values.

Hyperparameter
advanced
1:30remaining
What is the effect of increasing the weight decay hyperparameter too much during training?

If you increase the weight decay value excessively, what is the most likely effect on the model's training?

AThe model will ignore the training data and memorize the validation set.
BThe model weights will become very large and unstable.
CThe model will train faster and achieve higher accuracy.
DThe model weights will shrink too much, causing underfitting and poor performance.
Attempts:
2 left
💡 Hint

Consider what happens if the penalty on weights is very strong.

🔧 Debug
expert
2:30remaining
Why does this PyTorch training code not apply weight decay as expected?

Look at this PyTorch code snippet. The user expects weight decay to be applied, but the model's weights do not shrink. What is the bug?

PyTorch
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(2, 1)
optimizer = optim.SGD(model.parameters(), lr=0.1, weight_decay=0.01)

x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[1.0]])

criterion = nn.MSELoss()

for _ in range(10):
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()

print(model.weight)
AWeight decay is not applied because the model's bias parameters are updated but weights are not.
BWeight decay is not applied because the model's weights are frozen by default.
CWeight decay is not applied because the optimizer is created before model parameters are initialized.
DThe weight_decay parameter is ignored because the learning rate is too high.
Attempts:
2 left
💡 Hint

Check which parameters are updated and how weight decay affects weights vs biases.

Practice

(1/5)
1. What is the main purpose of weight decay (L2 regularization) in training a PyTorch model?
easy
A. To reduce overfitting by penalizing large weights
B. To increase the learning rate automatically
C. To add more layers to the model
D. To speed up the training process

Solution

  1. Step 1: Understand weight decay concept

    Weight decay adds a penalty to large weights during training to prevent the model from fitting noise in the data.
  2. Step 2: Connect to overfitting reduction

    By keeping weights small, the model generalizes better and avoids overfitting.
  3. Final Answer:

    To reduce overfitting by penalizing large weights -> Option A
  4. Quick Check:

    Weight decay = reduces overfitting [OK]
Hint: Weight decay shrinks weights to avoid overfitting [OK]
Common Mistakes:
  • Confusing weight decay with learning rate changes
  • Thinking weight decay adds layers
  • Assuming weight decay speeds training
2. Which of the following is the correct way to apply weight decay in a PyTorch optimizer?
easy
A. optimizer = torch.optim.SGD(model.parameters(), lr=0.01, wd=0.001)
B. optimizer = torch.optim.SGD(model.parameters(), lr=0.01, decay_weight=0.001)
C. optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weightDecay=0.001)
D. optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.001)

Solution

  1. Step 1: Recall PyTorch optimizer syntax

    PyTorch optimizers accept a parameter named weight_decay to apply L2 regularization.
  2. Step 2: Identify correct parameter name

    Only optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.001) uses the exact parameter weight_decay correctly.
  3. Final Answer:

    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.001) -> Option D
  4. Quick Check:

    Correct parameter name is weight_decay [OK]
Hint: Use exact parameter name 'weight_decay' in optimizer [OK]
Common Mistakes:
  • Using wrong parameter names like decay_weight or wd
  • Capitalizing parameter names incorrectly
  • Confusing weight decay with learning rate
3. Consider this PyTorch code snippet:
import torch
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=0.01)
initial_weight = model.weight.data.clone()
optimizer.zero_grad()
output = model(torch.tensor([[1.0, 2.0]]))
loss = output.sum()
loss.backward()
optimizer.step()
updated_weight = model.weight.data
print((initial_weight - updated_weight).abs().sum().item())

What does the printed value represent?
medium
A. The total change in weights after one optimization step including weight decay
B. The learning rate value
C. The loss value before backward pass
D. The sum of model outputs

Solution

  1. Step 1: Understand code flow

    The code runs one optimizer step with weight decay, then measures how much weights changed.
  2. Step 2: Interpret printed value

    The printed value is the sum of absolute differences between initial and updated weights, showing total weight change including weight decay effect.
  3. Final Answer:

    The total change in weights after one optimization step including weight decay -> Option A
  4. Quick Check:

    Weight change sum = printed value [OK]
Hint: Weight decay affects weight updates, so weight change includes it [OK]
Common Mistakes:
  • Thinking printed value is loss or learning rate
  • Ignoring weight decay effect on weights
  • Confusing output sum with weight change
4. You wrote this PyTorch optimizer code:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.1)

But your model is overfitting badly. What is a likely mistake?
medium
A. Weight decay value is too high, causing poor training
B. Weight decay should be set to zero to reduce overfitting
C. Weight decay is applied to biases by default, so overfitting remains
D. Learning rate is too low to affect weight decay

Solution

  1. Step 1: Recall weight decay behavior in PyTorch

    By default, weight decay is applied to all parameters, including biases and batch norm weights, unless explicitly excluded.
  2. Step 2: Understand overfitting cause

    If weight decay is applied to all parameters including biases, it may not reduce overfitting effectively because biases are not regularized properly.
  3. Final Answer:

    Weight decay is applied to biases by default, so overfitting remains -> Option C
  4. Quick Check:

    Biases often excluded from weight decay for better regularization [OK]
Hint: Check if weight decay excludes biases to reduce overfitting [OK]
Common Mistakes:
  • Assuming weight decay does not apply to biases
  • Setting weight decay to zero to fix overfitting
  • Blaming learning rate for weight decay issues
5. You want to apply weight decay only to the weights of a PyTorch model's linear layers but not to biases. Which code snippet correctly sets this up?
hard
A. optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)
B. params = [ {'params': [p for n, p in model.named_parameters() if 'weight' in n], 'weight_decay': 0.01}, {'params': [p for n, p in model.named_parameters() if 'bias' in n], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params, lr=0.001)
C. optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.0)
D. params = [ {'params': model.parameters(), 'weight_decay': 0.01} ] optimizer = torch.optim.Adam(params, lr=0.001)

Solution

  1. Step 1: Understand selective weight decay

    To apply weight decay only to weights, separate parameters into groups with and without weight decay.
  2. Step 2: Check code correctness

    params = [ {'params': [p for n, p in model.named_parameters() if 'weight' in n], 'weight_decay': 0.01}, {'params': [p for n, p in model.named_parameters() if 'bias' in n], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params, lr=0.001) creates two groups: weights with weight_decay=0.01 and biases with weight_decay=0.0, correctly excluding biases.
  3. Final Answer:

    Code snippet that separates weights and biases with different weight_decay values -> Option B
  4. Quick Check:

    Separate params for weight decay control [OK]
Hint: Group parameters by name to apply weight decay selectively [OK]
Common Mistakes:
  • Applying weight decay to all parameters blindly
  • Not separating biases from weights
  • Using wrong parameter names in filtering