Recall & Review
beginner
What is gradient accumulation in training neural networks?
Gradient accumulation is a technique where gradients from multiple mini-batches are added together before updating the model weights. This simulates a larger batch size without needing more memory.
Click to reveal answer
beginner
Why use gradient accumulation instead of increasing batch size directly?
Because hardware memory limits may prevent using large batches. Gradient accumulation lets you use small batches but still get the effect of a large batch by summing gradients over steps.
Click to reveal answer
intermediate
In PyTorch, what must you do before calling
loss.backward() when using gradient accumulation?You should NOT call
optimizer.zero_grad() every mini-batch. Instead, zero gradients only after the accumulated steps to keep adding gradients.Click to reveal answer
intermediate
How do you update model weights when using gradient accumulation with an accumulation step of 4?
You call
optimizer.step() and optimizer.zero_grad() only every 4 mini-batches after accumulating gradients from each batch.Click to reveal answer
intermediate
What is a simple PyTorch code pattern for gradient accumulation?
For each mini-batch: compute loss, call
loss.backward() without zeroing grads, then every N batches call optimizer.step() and optimizer.zero_grad() to update weights and reset gradients.Click to reveal answer
What does gradient accumulation help with?
✗ Incorrect
Gradient accumulation sums gradients over multiple batches to simulate a larger batch size without needing more memory.
When using gradient accumulation, when should you call
optimizer.zero_grad()?✗ Incorrect
You zero gradients only after accumulating over the desired number of mini-batches to keep adding gradients.
If your accumulation step is 3, how often do you call
optimizer.step()?✗ Incorrect
You update weights every 3 mini-batches after accumulating gradients from each.
Which of these is NOT a benefit of gradient accumulation?
✗ Incorrect
Gradient accumulation helps with batch size and memory but does not automatically improve accuracy.
What happens if you forget to call
optimizer.zero_grad() after accumulation?✗ Incorrect
Not zeroing gradients causes them to accumulate indefinitely, leading to wrong updates.
Explain how gradient accumulation works and why it is useful in training neural networks.
Think about how small batches can add up to a big batch effect.
You got /4 concepts.
Describe the changes needed in a PyTorch training loop to implement gradient accumulation.
Focus on when to zero gradients and when to update weights.
You got /4 concepts.