beginner

What is gradient accumulation in training neural networks?

Gradient accumulation is a technique where gradients from multiple mini-batches are added together before updating the model weights. This simulates a larger batch size without needing more memory.

Click to reveal answer

beginner

Why use gradient accumulation instead of increasing batch size directly?

Because hardware memory limits may prevent using large batches. Gradient accumulation lets you use small batches but still get the effect of a large batch by summing gradients over steps.

Click to reveal answer

intermediate

In PyTorch, what must you do before calling loss.backward() when using gradient accumulation?

You should NOT call optimizer.zero_grad() every mini-batch. Instead, zero gradients only after the accumulated steps to keep adding gradients.

Click to reveal answer

intermediate

How do you update model weights when using gradient accumulation with an accumulation step of 4?

You call optimizer.step() and optimizer.zero_grad() only every 4 mini-batches after accumulating gradients from each batch.

Click to reveal answer

intermediate

What is a simple PyTorch code pattern for gradient accumulation?

For each mini-batch: compute loss, call loss.backward() without zeroing grads, then every N batches call optimizer.step() and optimizer.zero_grad() to update weights and reset gradients.

Click to reveal answer

What does gradient accumulation help with?

AIncreasing learning rate automatically

BReducing the number of model parameters

CSimulating larger batch sizes without extra memory

DStopping training early

When using gradient accumulation, when should you call optimizer.zero_grad()?

AAfter accumulating gradients over several mini-batches

BAfter every mini-batch

CBefore every mini-batch

DNever

If your accumulation step is 3, how often do you call optimizer.step()?

AEvery mini-batch

BEvery 3 mini-batches

CEvery 2 mini-batches

DOnly once at the end

Which of these is NOT a benefit of gradient accumulation?

AImproves model accuracy by default

BAllows training with large effective batch sizes on limited memory

CHelps stabilize training by simulating bigger batches

DReduces memory usage compared to large batch training

What happens if you forget to call optimizer.zero_grad() after accumulation?

ATraining stops immediately

BNothing, training continues normally

CModel weights reset to initial values

DGradients keep accumulating, possibly causing incorrect updates

Explain how gradient accumulation works and why it is useful in training neural networks.

Describe the changes needed in a PyTorch training loop to implement gradient accumulation.