Challenge - 5 Problems

🎖️

Gradient Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of gradient accumulation with zeroing

Consider the following PyTorch code snippet that performs gradient accumulation over two mini-batches before updating the model parameters. What will be the value of model.linear.weight.grad after the second backward call and before optimizer.step()?

PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1, bias=False)

model = SimpleModel()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Input and target batches
inputs1 = torch.tensor([[1.0]])
targets1 = torch.tensor([[2.0]])
inputs2 = torch.tensor([[2.0]])
targets2 = torch.tensor([[4.0]])

criterion = nn.MSELoss()

# First mini-batch
outputs1 = model(inputs1)
loss1 = criterion(outputs1, targets1)
loss1.backward()  # Gradients computed and accumulated

# Second mini-batch
outputs2 = model(inputs2)
loss2 = criterion(outputs2, targets2)
loss2.backward()  # Gradients accumulated again

# What is model.linear.weight.grad here?

AA tensor approximately equal to [[-15.0]] representing triple the gradient from one batch

BA tensor approximately equal to [[-10.0]] representing the sum of gradients from both batches

CNone, because gradients are zeroed automatically after each backward call

DA tensor approximately equal to [[-5.0]] representing only the gradient from the second batch

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Why zero gradients before backward?

Why do we usually call optimizer.zero_grad() before calling loss.backward() in a training loop?

ATo clear old gradients so that new gradients are not accumulated on top of them

BTo reset model weights to zero before each update

CTo initialize the optimizer's learning rate

DTo save memory by deleting the computation graph

Attempts:

2 left

❓ Hyperparameter

advanced

1:30remaining

Choosing accumulation steps for gradient accumulation

You want to simulate a batch size of 64 but your GPU can only handle batch size 16. You decide to use gradient accumulation with accumulation_steps. Which value of accumulation_steps correctly simulates the larger batch size?

A8, because 16 / 8 = 2

B2, because 16 + 2 = 18

C4, because 16 * 4 = 64

D1, because no accumulation is needed

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does the model not train when zero_grad is missing?

You wrote this training loop but the model's loss does not decrease over epochs. What is the most likely cause?

PyTorch

for epoch in range(5):
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        # Missing optimizer.zero_grad()

AThe optimizer.step() is called before loss.backward(), causing errors

BThe model weights are reset to zero every batch, so no learning happens

CThe loss.backward() is not called, so no gradients are computed

DGradients accumulate every batch causing optimizer.step() to apply huge updates, destabilizing training

Attempts:

2 left

❓ Metrics

expert

2:00remaining

Effect of gradient accumulation on training metrics

You train a model with gradient accumulation over 4 steps and batch size 16, simulating batch size 64. Which of the following statements about training loss and accuracy metrics logged per optimizer step is true?

ALoss and accuracy logged per optimizer step correspond to the combined effect of 64 samples, so they are smoother and more stable

BLoss and accuracy logged per optimizer step correspond to only 16 samples, so they are noisier

CLoss and accuracy do not change because accumulation does not affect metrics

DLoss and accuracy are invalid because gradients are accumulated

Attempts:

2 left