Challenge - 5 Problems
Gradient Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of gradient accumulation with zeroing
Consider the following PyTorch code snippet that performs gradient accumulation over two mini-batches before updating the model parameters. What will be the value of
model.linear.weight.grad after the second backward call and before optimizer.step()?PyTorch
import torch import torch.nn as nn import torch.optim as optim class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(1, 1, bias=False) model = SimpleModel() optimizer = optim.SGD(model.parameters(), lr=0.1) # Input and target batches inputs1 = torch.tensor([[1.0]]) targets1 = torch.tensor([[2.0]]) inputs2 = torch.tensor([[2.0]]) targets2 = torch.tensor([[4.0]]) criterion = nn.MSELoss() # First mini-batch outputs1 = model(inputs1) loss1 = criterion(outputs1, targets1) loss1.backward() # Gradients computed and accumulated # Second mini-batch outputs2 = model(inputs2) loss2 = criterion(outputs2, targets2) loss2.backward() # Gradients accumulated again # What is model.linear.weight.grad here?
Attempts:
2 left
💡 Hint
Remember that calling backward accumulates gradients unless you zero them explicitly.
✗ Incorrect
In PyTorch, gradients accumulate by default. After the first backward call, the gradient is computed for the first batch. The second backward call adds the gradient from the second batch to the existing gradient. Since the model is a simple linear layer and the loss is MSE, the gradients add up. The value is approximately the sum of gradients from both batches.
🧠 Conceptual
intermediate1:30remaining
Why zero gradients before backward?
Why do we usually call
optimizer.zero_grad() before calling loss.backward() in a training loop?Attempts:
2 left
💡 Hint
Think about what happens if you don't clear gradients before backward.
✗ Incorrect
Calling optimizer.zero_grad() clears the gradients from the previous backward pass. Without this, gradients would accumulate, causing incorrect updates. This is important to ensure each update step uses only the current batch's gradients.
❓ Hyperparameter
advanced1:30remaining
Choosing accumulation steps for gradient accumulation
You want to simulate a batch size of 64 but your GPU can only handle batch size 16. You decide to use gradient accumulation with
accumulation_steps. Which value of accumulation_steps correctly simulates the larger batch size?Attempts:
2 left
💡 Hint
Think about how many small batches you need to accumulate to reach the target batch size.
✗ Incorrect
Gradient accumulation steps multiply the effective batch size. If your GPU can handle 16 samples at once, accumulating gradients over 4 steps simulates a batch size of 64 (16 * 4).
🔧 Debug
advanced2:00remaining
Why does the model not train when zero_grad is missing?
You wrote this training loop but the model's loss does not decrease over epochs. What is the most likely cause?
PyTorch
for epoch in range(5): for inputs, targets in dataloader: outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() # Missing optimizer.zero_grad()
Attempts:
2 left
💡 Hint
What happens if gradients are never cleared?
✗ Incorrect
Without zeroing gradients, they accumulate every batch. This causes optimizer.step() to apply very large updates, which can prevent the model from converging and cause loss to stay high or fluctuate.
❓ Metrics
expert2:00remaining
Effect of gradient accumulation on training metrics
You train a model with gradient accumulation over 4 steps and batch size 16, simulating batch size 64. Which of the following statements about training loss and accuracy metrics logged per optimizer step is true?
Attempts:
2 left
💡 Hint
Think about what batch size the optimizer step represents when using accumulation.
✗ Incorrect
When accumulating gradients over 4 steps of batch size 16, the optimizer step updates parameters as if using batch size 64. Metrics logged per optimizer step reflect this larger effective batch size, resulting in smoother and more stable values.