Consider the following PyTorch code that performs a backward pass and then prints the gradient of a tensor. What will be printed?
import torch x = torch.tensor([2.0], requires_grad=True) y = x * 3 z = y ** 2 z.backward() print(x.grad) x.grad.zero_() print(x.grad)
Remember that backward() computes gradients and zero_() sets them to zero in-place.
The gradient of z = (3x)^2 = 9x^2 with respect to x is 18x. At x=2, gradient is 18*2=36. But since y = 3x and z = y^2, chain rule gives dz/dx = 2y * dy/dx = 2*(3x)*3 = 18x. So gradient is 36. However, the code prints x.grad after backward() and then after zeroing. The first print shows the gradient tensor, the second shows zero tensor after zero_().
In a typical training loop, which method should be called to clear gradients before computing new ones?
Think about which object manages the parameters and their gradients during optimization.
The optimizer.zero_grad() method clears the gradients of all optimized tensors. This is necessary before calling loss.backward() to avoid gradient accumulation from previous steps.
Consider a training loop where optimizer.zero_grad() is never called. What is the effect on the gradients during training?
Think about how PyTorch accumulates gradients by default.
PyTorch accumulates gradients on each backward call. If you don't zero them, gradients from previous steps add up, leading to larger updates than expected.
Examine the code below. Why does it raise an error at x.grad.zero_()?
import torch x = torch.tensor([1.0, 2.0], requires_grad=True) y = x.sum() y.backward() x.grad = None x.grad.zero_()
What happens if you assign None to a variable and then call a method on it?
Setting x.grad = None removes the gradient tensor. Then calling zero_() on None raises AttributeError because NoneType has no such method.
In mini-batch gradient descent, why must gradients be zeroed before processing each batch?
Consider how PyTorch handles gradients across multiple backward passes.
PyTorch accumulates gradients by default. Zeroing gradients before each batch ensures that only the current batch's gradients affect the update, preventing unintended accumulation.