0
0
PyTorchml~5 mins

Gradient accumulation in PyTorch - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is gradient accumulation in training neural networks?
Gradient accumulation is a technique where gradients from multiple mini-batches are added together before updating the model weights. This simulates a larger batch size without needing more memory.
Click to reveal answer
beginner
Why use gradient accumulation instead of increasing batch size directly?
Because hardware memory limits may prevent using large batches. Gradient accumulation lets you use small batches but still get the effect of a large batch by summing gradients over steps.
Click to reveal answer
intermediate
In PyTorch, what must you do before calling loss.backward() when using gradient accumulation?
You should NOT call optimizer.zero_grad() every mini-batch. Instead, zero gradients only after the accumulated steps to keep adding gradients.
Click to reveal answer
intermediate
How do you update model weights when using gradient accumulation with an accumulation step of 4?
You call optimizer.step() and optimizer.zero_grad() only every 4 mini-batches after accumulating gradients from each batch.
Click to reveal answer
intermediate
What is a simple PyTorch code pattern for gradient accumulation?
For each mini-batch: compute loss, call loss.backward() without zeroing grads, then every N batches call optimizer.step() and optimizer.zero_grad() to update weights and reset gradients.
Click to reveal answer
What does gradient accumulation help with?
AIncreasing learning rate automatically
BReducing the number of model parameters
CSimulating larger batch sizes without extra memory
DStopping training early
When using gradient accumulation, when should you call optimizer.zero_grad()?
AAfter accumulating gradients over several mini-batches
BAfter every mini-batch
CBefore every mini-batch
DNever
If your accumulation step is 3, how often do you call optimizer.step()?
AEvery mini-batch
BEvery 3 mini-batches
CEvery 2 mini-batches
DOnly once at the end
Which of these is NOT a benefit of gradient accumulation?
AImproves model accuracy by default
BAllows training with large effective batch sizes on limited memory
CHelps stabilize training by simulating bigger batches
DReduces memory usage compared to large batch training
What happens if you forget to call optimizer.zero_grad() after accumulation?
ATraining stops immediately
BNothing, training continues normally
CModel weights reset to initial values
DGradients keep accumulating, possibly causing incorrect updates
Explain how gradient accumulation works and why it is useful in training neural networks.
Think about how small batches can add up to a big batch effect.
You got /4 concepts.
    Describe the changes needed in a PyTorch training loop to implement gradient accumulation.
    Focus on when to zero gradients and when to update weights.
    You got /4 concepts.