0
0
PyTorchml~5 mins

Gradient accumulation and zeroing in PyTorch - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is gradient accumulation in PyTorch?
Gradient accumulation is a technique where gradients are summed over multiple mini-batches before updating model weights. This helps simulate a larger batch size without increasing memory usage.
Click to reveal answer
beginner
Why do we need to zero gradients in PyTorch during training?
We zero gradients to clear old gradient values from the previous backward pass. Without zeroing, gradients would keep accumulating unintentionally, leading to incorrect updates.
Click to reveal answer
beginner
How do you zero gradients in PyTorch?
You call optimizer.zero_grad() before the backward pass to reset all gradients to zero.
Click to reveal answer
intermediate
What happens if you forget to zero gradients in a training loop?
Gradients from all previous batches accumulate, causing the model to update weights incorrectly and potentially harming training performance.
Click to reveal answer
intermediate
How does gradient accumulation help when GPU memory is limited?
By accumulating gradients over several small batches, you can simulate a larger batch size without needing to load all data at once, saving memory.
Click to reveal answer
What PyTorch function is used to clear gradients before a backward pass?
Aoptimizer.step()
Bmodel.zero_grad()
Closs.backward()
Doptimizer.zero_grad()
Why accumulate gradients over multiple batches?
ATo increase effective batch size without extra memory
BTo speed up training by skipping backward passes
CTo avoid zeroing gradients
DTo reduce model size
What happens if you call optimizer.step() without zeroing gradients first?
AWeights update with accumulated gradients from previous steps
BWeights do not update
CTraining crashes
DGradients reset automatically
Which of these is NOT a reason to use gradient accumulation?
ALimited GPU memory
BSimulate larger batch size
CAvoid zeroing gradients
DImprove training stability
When should optimizer.zero_grad() be called in the training loop?
AAfter optimizer.step()
BBefore loss.backward()
CAfter loss.backward()
DAt the end of training
Explain how gradient accumulation works and why it is useful in training deep learning models.
Think about training with small batches but wanting the effect of a big batch.
You got /4 concepts.
    Describe the importance of zeroing gradients in PyTorch and what could happen if you skip this step.
    Consider what happens if gradients keep adding up every batch.
    You got /4 concepts.