Challenge - 5 Problems

🎖️

Gradient Accumulation Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of gradient accumulation with batch size 2

Consider the following PyTorch code snippet that uses gradient accumulation with accumulation_steps=2. What will be the printed value of model.linear.weight.grad after the loop?

PyTorch

import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1, bias=False)

model = SimpleModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

inputs = [torch.tensor([[1.0]]), torch.tensor([[2.0]])]
targets = [torch.tensor([[2.0]]), torch.tensor([[4.0]])]

accumulation_steps = 2
optimizer.zero_grad()

for i in range(2):
    output = model.linear(inputs[i])
    loss = (output - targets[i]).pow(2).mean()
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

print(model.linear.weight.grad)

Atensor([[1.]])

BNone

Ctensor([[0.]])

Dtensor([[0.5]])

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Why use gradient accumulation?

Why do we use gradient accumulation in training deep learning models?

ATo avoid using an optimizer during training.

BTo reduce the number of model parameters by accumulating gradients.

CTo speed up training by skipping backward passes.

DTo simulate a larger batch size than fits in memory by accumulating gradients over smaller batches before updating weights.

Attempts:

2 left

❓ Hyperparameter

advanced

1:30remaining

Effect of accumulation steps on learning rate

If you increase the number of gradient accumulation steps from 1 to 4 without changing the learning rate, what is the effective change in the learning rate per sample?

AThe effective learning rate per sample is 4 times smaller.

BThe effective learning rate per sample is 4 times larger.

CThe effective learning rate per sample stays the same.

DThe effective learning rate per sample becomes zero.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Bug in gradient accumulation code

What error will the following PyTorch code raise?

import torch
import torch.nn as nn

model = nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

accumulation_steps = 2
optimizer.zero_grad()

for i in range(3):
    x = torch.tensor([[float(i + 1)]])
    y = torch.tensor([[2.0 * (i + 1)]])
    output = model(x)
    loss = (output - y).pow(2).mean()
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

optimizer.step()

ARuntimeError: optimizer.step() called twice without backward()

BRuntimeError: optimizer.step() called without zero_grad()

CRuntimeError: Trying to backward through the graph a second time

DNo error, code runs fine

Attempts:

2 left

❓ Model Choice

expert

2:30remaining

Choosing batch size and accumulation steps for limited GPU memory

You want to train a large transformer model but your GPU memory only fits batch size 8. You want an effective batch size of 32. Which approach is best?

AUse batch size 4 with accumulation steps = 8 to simulate batch size 32.

BUse batch size 8 with gradient accumulation steps = 4 to simulate batch size 32.

CUse batch size 8 and update weights after every batch without accumulation.

DUse batch size 32 and reduce model size to fit GPU memory.

Attempts:

2 left