What is the main benefit of using Automatic Mixed Precision (AMP) in PyTorch training?
Think about how using smaller number formats affects speed and memory.
AMP uses float16 precision where safe, reducing memory and speeding up training without losing accuracy.
What will be the printed loss value type after this AMP training step?
import torch from torch.cuda.amp import autocast, GradScaler model = torch.nn.Linear(2, 1).cuda() optimizer = torch.optim.SGD(model.parameters(), lr=0.1) scaler = GradScaler() inputs = torch.tensor([[1.0, 2.0]], device='cuda') target = torch.tensor([[1.0]], device='cuda') optimizer.zero_grad() with autocast(): output = model(inputs) loss = torch.nn.functional.mse_loss(output, target) print(type(loss))
AMP uses float16 for some ops but loss is usually float32 for stability.
Loss tensors are kept in float32 even inside autocast to maintain numerical stability.
Which part of a model should NOT be wrapped inside autocast() for AMP training?
Consider which operations benefit from mixed precision and which do not.
Optimizer step updates weights and should not be inside autocast; autocast is for forward and loss computations.
What is the role of GradScaler in PyTorch AMP training?
Think about why gradients might vanish when using float16.
GradScaler multiplies loss by a scale factor to keep gradients in a safe range, avoiding underflow in float16.
Given this AMP training snippet, what error will occur and why?
import torch
from torch.cuda.amp import autocast, GradScaler
model = torch.nn.Linear(2, 1).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scaler = GradScaler()
inputs = torch.tensor([[1.0, 2.0]], device='cuda')
target = torch.tensor([[1.0]], device='cuda')
optimizer.zero_grad()
with autocast():
output = model(inputs)
loss = torch.nn.functional.mse_loss(output, target)
loss.backward()
scaler.step(optimizer)
scaler.update()Check how gradients are computed when using GradScaler.
When using GradScaler, loss.backward() must be replaced by scaler.scale(loss).backward() to scale gradients properly.