Imagine you are training a neural network and notice the training loss suddenly spikes or the model weights become very large. Why would applying gradient clipping help in this situation?
Think about what happens when gradients become very large during backpropagation.
Gradient clipping limits the size of gradients during training. This prevents very large updates that can cause the model to become unstable or diverge. It is especially useful in recurrent neural networks or deep networks where exploding gradients are common.
What will be the value of clipped_norm after running this PyTorch code?
import torch from torch.nn.utils import clip_grad_norm_ model_params = [torch.nn.Parameter(torch.tensor([3.0, 4.0], requires_grad=True))] for p in model_params: p.grad = torch.tensor([6.0, 8.0]) clipped_norm = clip_grad_norm_(model_params, max_norm=5.0) print(round(clipped_norm.item(), 2))
Calculate the norm of the original gradients before clipping.
The original gradient vector is [6, 8]. Its L2 norm is sqrt(6^2 + 8^2) = 10. The function returns the total norm before clipping, which is 10.0.
You are training two models: a shallow feedforward network and a deep recurrent neural network (RNN). Which model benefits more from gradient clipping and why?
Consider which model type is more likely to have exploding gradients.
Deep RNNs often suffer from exploding gradients due to repeated multiplication of gradients through many time steps. Gradient clipping helps stabilize training in such models. Shallow feedforward networks usually do not have this problem.
In PyTorch's clip_grad_norm_, what happens if you set max_norm to a very small value like 0.1 during training?
Think about what happens when gradients are clipped to a very small norm.
If max_norm is too small, gradients get scaled down a lot, making weight updates very small. This can slow down or stall training because the model learns very slowly.
Consider this PyTorch training loop snippet. What error will it raise and why?
import torch from torch.nn.utils import clip_grad_norm_ model = torch.nn.Linear(2, 1) optimizer = torch.optim.SGD(model.parameters(), lr=0.1) inputs = torch.tensor([[1.0, 2.0]]) targets = torch.tensor([[1.0]]) optimizer.zero_grad() outputs = model(inputs) loss = torch.nn.functional.mse_loss(outputs, targets) loss.backward() clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() # Next iteration without zero_grad outputs = model(inputs) loss = torch.nn.functional.mse_loss(outputs, targets) loss.backward() clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step()
Look at the second backward call without clearing gradients.
No RuntimeError is raised because each loss.backward() call builds and backpropagates through a separate computational graph. However, omitting optimizer.zero_grad() before the second loss.backward() causes the new gradients to accumulate onto the stale gradients left over from the first iteration (after optimizer.step(), which does not clear gradients). This leads to incorrect parameter updates and is a common training loop bug.