0
0
PyTorchml~20 mins

Gradient accumulation in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Gradient Accumulation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of gradient accumulation with batch size 2

Consider the following PyTorch code snippet that uses gradient accumulation with accumulation_steps=2. What will be the printed value of model.linear.weight.grad after the loop?

PyTorch
import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1, bias=False)

model = SimpleModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

inputs = [torch.tensor([[1.0]]), torch.tensor([[2.0]])]
targets = [torch.tensor([[2.0]]), torch.tensor([[4.0]])]

accumulation_steps = 2
optimizer.zero_grad()

for i in range(2):
    output = model.linear(inputs[i])
    loss = (output - targets[i]).pow(2).mean()
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

print(model.linear.weight.grad)
Atensor([[1.]])
BNone
Ctensor([[0.]])
Dtensor([[0.5]])
Attempts:
2 left
💡 Hint

Remember that after optimizer.step() and optimizer.zero_grad(), gradients are reset.

🧠 Conceptual
intermediate
1:30remaining
Why use gradient accumulation?

Why do we use gradient accumulation in training deep learning models?

ATo avoid using an optimizer during training.
BTo reduce the number of model parameters by accumulating gradients.
CTo speed up training by skipping backward passes.
DTo simulate a larger batch size than fits in memory by accumulating gradients over smaller batches before updating weights.
Attempts:
2 left
💡 Hint

Think about memory limits and batch sizes.

Hyperparameter
advanced
1:30remaining
Effect of accumulation steps on learning rate

If you increase the number of gradient accumulation steps from 1 to 4 without changing the learning rate, what is the effective change in the learning rate per sample?

AThe effective learning rate per sample is 4 times smaller.
BThe effective learning rate per sample is 4 times larger.
CThe effective learning rate per sample stays the same.
DThe effective learning rate per sample becomes zero.
Attempts:
2 left
💡 Hint

Think about how many updates happen per sample when accumulating gradients.

🔧 Debug
advanced
2:00remaining
Bug in gradient accumulation code

What error will the following PyTorch code raise?

import torch
import torch.nn as nn

model = nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

accumulation_steps = 2
optimizer.zero_grad()

for i in range(3):
    x = torch.tensor([[float(i + 1)]])
    y = torch.tensor([[2.0 * (i + 1)]])
    output = model(x)
    loss = (output - y).pow(2).mean()
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

optimizer.step()
ARuntimeError: optimizer.step() called twice without backward()
BRuntimeError: optimizer.step() called without zero_grad()
CRuntimeError: Trying to backward through the graph a second time
DNo error, code runs fine
Attempts:
2 left
💡 Hint

Check how many times optimizer.step() is called relative to loss.backward().

Model Choice
expert
2:30remaining
Choosing batch size and accumulation steps for limited GPU memory

You want to train a large transformer model but your GPU memory only fits batch size 8. You want an effective batch size of 32. Which approach is best?

AUse batch size 4 with accumulation steps = 8 to simulate batch size 32.
BUse batch size 8 with gradient accumulation steps = 4 to simulate batch size 32.
CUse batch size 8 and update weights after every batch without accumulation.
DUse batch size 32 and reduce model size to fit GPU memory.
Attempts:
2 left
💡 Hint

Think about memory limits and how accumulation simulates larger batch sizes.