0
0
PyTorchml~20 mins

Gradient accumulation and zeroing in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Gradient Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of gradient accumulation with zeroing
Consider the following PyTorch code snippet that performs gradient accumulation over two mini-batches before updating the model parameters. What will be the value of model.linear.weight.grad after the second backward call and before optimizer.step()?
PyTorch
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1, bias=False)

model = SimpleModel()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Input and target batches
inputs1 = torch.tensor([[1.0]])
targets1 = torch.tensor([[2.0]])
inputs2 = torch.tensor([[2.0]])
targets2 = torch.tensor([[4.0]])

criterion = nn.MSELoss()

# First mini-batch
outputs1 = model(inputs1)
loss1 = criterion(outputs1, targets1)
loss1.backward()  # Gradients computed and accumulated

# Second mini-batch
outputs2 = model(inputs2)
loss2 = criterion(outputs2, targets2)
loss2.backward()  # Gradients accumulated again

# What is model.linear.weight.grad here?
AA tensor approximately equal to [[-15.0]] representing triple the gradient from one batch
BA tensor approximately equal to [[-10.0]] representing the sum of gradients from both batches
CNone, because gradients are zeroed automatically after each backward call
DA tensor approximately equal to [[-5.0]] representing only the gradient from the second batch
Attempts:
2 left
💡 Hint
Remember that calling backward accumulates gradients unless you zero them explicitly.
🧠 Conceptual
intermediate
1:30remaining
Why zero gradients before backward?
Why do we usually call optimizer.zero_grad() before calling loss.backward() in a training loop?
ATo clear old gradients so that new gradients are not accumulated on top of them
BTo reset model weights to zero before each update
CTo initialize the optimizer's learning rate
DTo save memory by deleting the computation graph
Attempts:
2 left
💡 Hint
Think about what happens if you don't clear gradients before backward.
Hyperparameter
advanced
1:30remaining
Choosing accumulation steps for gradient accumulation
You want to simulate a batch size of 64 but your GPU can only handle batch size 16. You decide to use gradient accumulation with accumulation_steps. Which value of accumulation_steps correctly simulates the larger batch size?
A8, because 16 / 8 = 2
B2, because 16 + 2 = 18
C4, because 16 * 4 = 64
D1, because no accumulation is needed
Attempts:
2 left
💡 Hint
Think about how many small batches you need to accumulate to reach the target batch size.
🔧 Debug
advanced
2:00remaining
Why does the model not train when zero_grad is missing?
You wrote this training loop but the model's loss does not decrease over epochs. What is the most likely cause?
PyTorch
for epoch in range(5):
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        # Missing optimizer.zero_grad()
AThe optimizer.step() is called before loss.backward(), causing errors
BThe model weights are reset to zero every batch, so no learning happens
CThe loss.backward() is not called, so no gradients are computed
DGradients accumulate every batch causing optimizer.step() to apply huge updates, destabilizing training
Attempts:
2 left
💡 Hint
What happens if gradients are never cleared?
Metrics
expert
2:00remaining
Effect of gradient accumulation on training metrics
You train a model with gradient accumulation over 4 steps and batch size 16, simulating batch size 64. Which of the following statements about training loss and accuracy metrics logged per optimizer step is true?
ALoss and accuracy logged per optimizer step correspond to the combined effect of 64 samples, so they are smoother and more stable
BLoss and accuracy logged per optimizer step correspond to only 16 samples, so they are noisier
CLoss and accuracy do not change because accumulation does not affect metrics
DLoss and accuracy are invalid because gradients are accumulated
Attempts:
2 left
💡 Hint
Think about what batch size the optimizer step represents when using accumulation.