Consider the following PyTorch code snippet that uses gradient accumulation with accumulation_steps=2. What will be the printed value of model.linear.weight.grad after the loop?
import torch import torch.nn as nn class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(1, 1, bias=False) model = SimpleModel() optimizer = torch.optim.SGD(model.parameters(), lr=0.1) inputs = [torch.tensor([[1.0]]), torch.tensor([[2.0]])] targets = [torch.tensor([[2.0]]), torch.tensor([[4.0]])] accumulation_steps = 2 optimizer.zero_grad() for i in range(2): output = model.linear(inputs[i]) loss = (output - targets[i]).pow(2).mean() loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() print(model.linear.weight.grad)
Remember that after optimizer.step() and optimizer.zero_grad(), gradients are reset.
After the loop, optimizer.step() and optimizer.zero_grad() are called, so gradients are cleared. Therefore, model.linear.weight.grad is a tensor of zeros.
Why do we use gradient accumulation in training deep learning models?
Think about memory limits and batch sizes.
Gradient accumulation allows training with an effective larger batch size by summing gradients over multiple smaller batches before updating model weights, which helps when GPU memory is limited.
If you increase the number of gradient accumulation steps from 1 to 4 without changing the learning rate, what is the effective change in the learning rate per sample?
Think about how many updates happen per sample when accumulating gradients.
With 4 accumulation steps, the model updates weights once every 4 batches, so each sample's gradient contributes less frequently, effectively reducing the learning rate per sample by 4 times.
What error will the following PyTorch code raise?
import torch
import torch.nn as nn
model = nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
accumulation_steps = 2
optimizer.zero_grad()
for i in range(3):
x = torch.tensor([[float(i + 1)]])
y = torch.tensor([[2.0 * (i + 1)]])
output = model(x)
loss = (output - y).pow(2).mean()
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
optimizer.step()Check how many times optimizer.step() is called relative to loss.backward().
The last optimizer.step() is called after the loop without a preceding loss.backward(), causing a RuntimeError because optimizer.step() expects gradients from a backward pass.
You want to train a large transformer model but your GPU memory only fits batch size 8. You want an effective batch size of 32. Which approach is best?
Think about memory limits and how accumulation simulates larger batch sizes.
Using batch size 8 with accumulation steps 4 allows simulating batch size 32 without exceeding GPU memory. Batch size 32 may not fit, and batch size 4 with accumulation 8 is less efficient due to smaller batches.