Gradient clipping is a technique to keep the training stable by limiting how big the model's updates can be. The key metric to watch is the training loss and gradient norm. If gradients get too large, the loss can jump or become NaN (not a number). Clipping helps keep gradients in a safe range, so the loss decreases smoothly. Monitoring the gradient norm before and after clipping shows if clipping is working.
Gradient clipping in PyTorch - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Gradient clipping does not directly relate to classification metrics like confusion matrix. Instead, we visualize gradient norms and loss values.
Epoch | Gradient Norm Before Clipping | Gradient Norm After Clipping | Training Loss
---------------------------------------------------------------
1 | 15.2 | 5.0 | 2.3
2 | 12.7 | 5.0 | 1.8
3 | 20.5 | 5.0 | 1.2
4 | 4.8 | 4.8 | 0.9
5 | 3.2 | 3.2 | 0.7
This shows clipping keeps gradients from exploding (too big), helping loss go down steadily.
Think of gradient clipping like setting a speed limit for a car. Without a limit, the car (model updates) might speed dangerously (explode gradients), causing crashes (training failure). But if the limit is too low, the car moves too slowly (small updates), and training takes forever or gets stuck.
So, the tradeoff is between too much clipping (slow learning) and too little clipping (unstable training). Finding the right clipping value balances fast learning and stable updates.
- Good: Gradient norms before clipping sometimes exceed the threshold, but after clipping they stay below it. Training loss decreases smoothly without sudden jumps or NaNs.
- Bad: Gradient norms explode to very large values, causing loss to jump or become NaN. Or clipping is too aggressive, gradients are always very small, and loss decreases very slowly or plateaus.
- Ignoring gradient norms: Not monitoring gradient sizes can hide exploding gradients causing training failure.
- Clipping too early or too late: Applying clipping only after training is unstable wastes time; applying too aggressively slows learning.
- Using wrong clipping method: Clipping by value vs clipping by norm have different effects; norm clipping is usually better.
- Confusing loss spikes: Sudden loss jumps might be due to other bugs, not just gradients.
Your model's training loss jumps to NaN after a few steps. Gradient norms before clipping are very large (e.g., 100), but after clipping they are capped at 5. Is your gradient clipping working well? What should you do?
Answer: Clipping is limiting gradients to 5, but loss still becomes NaN, so clipping alone is not enough. You might need to lower the clipping threshold, reduce learning rate, or check for other bugs. Gradient clipping helps but does not fix all training issues.
Practice
Solution
Step 1: Understand gradient behavior during training
Gradients can sometimes become very large, causing unstable updates and training divergence.Step 2: Role of gradient clipping
Gradient clipping limits the size of gradients to keep training stable and prevent exploding gradients.Final Answer:
To prevent gradients from becoming too large and destabilizing training -> Option AQuick Check:
Gradient clipping = prevent large gradients [OK]
- Thinking it changes learning rate
- Confusing with weight initialization
- Believing it reduces model size
Solution
Step 1: Recall PyTorch gradient clipping functions
PyTorch provides two main functions: clip_grad_norm_ and clip_grad_value_ in torch.nn.utils.Step 2: Identify function for norm clipping
clip_grad_norm_ clips gradients based on their total norm, while clip_grad_value_ clips individual gradient values.Final Answer:
torch.nn.utils.clip_grad_norm_ -> Option CQuick Check:
Norm clipping function = clip_grad_norm_ [OK]
- Using clip_grad_value_ for norm clipping
- Assuming optimizer has clipping functions
- Using non-existent torch.clip_gradients
import torch
from torch.nn.utils import clip_grad_norm_
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
inputs = torch.tensor([[1.0, 2.0]])
target = torch.tensor([[1.0]])
optimizer.zero_grad()
output = model(inputs)
loss = (output - target).pow(2).mean()
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=0.1)
for p in model.parameters():
print(p.grad.norm().item())Solution
Step 1: Understand code flow and gradient clipping
Gradients are computed by loss.backward(), then clipped by clip_grad_norm_ with max_norm=0.1.Step 2: Effect of clip_grad_norm_ on gradients
clip_grad_norm_ rescales gradients so their total norm does not exceed 0.1, so printed norms will be ≤ 0.1.Final Answer:
All printed gradient norms will be less than or equal to 0.1 -> Option DQuick Check:
clip_grad_norm_ limits gradient norm ≤ max_norm [OK]
- Calling clip_grad_norm_ before backward()
- Expecting gradients to be zero after clipping
- Thinking clipping increases gradient norms
optimizer.zero_grad() output = model(data) loss = criterion(output, target) clip_grad_norm_(model.parameters(), max_norm=1.0) loss.backward() optimizer.step()
Solution
Step 1: Check order of operations for gradient clipping
Gradients are created by loss.backward(), so clipping must happen after backward() to affect gradients.Step 2: Identify mistake in code order
clip_grad_norm_ is called before loss.backward(), so gradients do not exist yet and clipping has no effect.Final Answer:
clip_grad_norm_ is called before loss.backward(), so gradients are not clipped -> Option BQuick Check:
Clip gradients after backward() [OK]
- Clipping before backward()
- Calling zero_grad() after step()
- Setting max_norm to zero
Solution
Step 1: Understand correct gradient clipping sequence
Gradients are computed by loss.backward(), so clipping must happen after this step and before optimizer.step().Step 2: Identify correct function and order
clip_grad_norm_ is used to clip by norm, suitable for RNNs to prevent exploding gradients. It must be called after backward() and before optimizer.step().Final Answer:
After loss.backward(), call torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5), then optimizer.step() -> Option AQuick Check:
Clip gradients after backward(), before step() [OK]
- Clipping before backward()
- Clipping after optimizer.step()
- Using clip_grad_value_ incorrectly
