Gradient clipping helps keep the training stable by stopping very large updates to the model. It prevents the model from making big jumps that can cause errors.
Gradient clipping in PyTorch
Start learning this pattern below
Jump into concepts and practice - no test required
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
# or
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value)clip_grad_norm_ limits the total size (norm) of all gradients combined.
clip_grad_value_ limits each gradient value individually.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)This example shows how to clip gradients to a maximum norm of 1.0 during training. It prints the gradient norm before and after clipping to see the effect.
import torch import torch.nn as nn import torch.optim as optim # Simple model class SimpleNet(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(2, 1) def forward(self, x): return self.linear(x) # Create model, loss, optimizer model = SimpleNet() criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.1) # Dummy data inputs = torch.tensor([[10.0, 20.0], [30.0, 40.0]]) targets = torch.tensor([[1.0], [2.0]]) # Forward pass outputs = model(inputs) loss = criterion(outputs, targets) # Backward pass loss.backward() # Before clipping: print gradient norm total_norm = 0 for p in model.parameters(): if p.grad is not None: param_norm = p.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = total_norm ** 0.5 print(f"Gradient norm before clipping: {total_norm:.4f}") # Clip gradients torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # After clipping: print gradient norm total_norm = 0 for p in model.parameters(): if p.grad is not None: param_norm = p.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = total_norm ** 0.5 print(f"Gradient norm after clipping: {total_norm:.4f}") # Optimizer step optimizer.step()
Gradient clipping should be done after calling loss.backward() and before optimizer.step().
Clipping helps prevent the problem called 'exploding gradients' which can make training unstable.
Gradient clipping keeps training stable by limiting how big gradients can get.
Use clip_grad_norm_ to limit total gradient size or clip_grad_value_ to limit individual values.
Always clip gradients after backward pass and before optimizer step.
Practice
Solution
Step 1: Understand gradient behavior during training
Gradients can sometimes become very large, causing unstable updates and training divergence.Step 2: Role of gradient clipping
Gradient clipping limits the size of gradients to keep training stable and prevent exploding gradients.Final Answer:
To prevent gradients from becoming too large and destabilizing training -> Option AQuick Check:
Gradient clipping = prevent large gradients [OK]
- Thinking it changes learning rate
- Confusing with weight initialization
- Believing it reduces model size
Solution
Step 1: Recall PyTorch gradient clipping functions
PyTorch provides two main functions: clip_grad_norm_ and clip_grad_value_ in torch.nn.utils.Step 2: Identify function for norm clipping
clip_grad_norm_ clips gradients based on their total norm, while clip_grad_value_ clips individual gradient values.Final Answer:
torch.nn.utils.clip_grad_norm_ -> Option CQuick Check:
Norm clipping function = clip_grad_norm_ [OK]
- Using clip_grad_value_ for norm clipping
- Assuming optimizer has clipping functions
- Using non-existent torch.clip_gradients
import torch
from torch.nn.utils import clip_grad_norm_
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
inputs = torch.tensor([[1.0, 2.0]])
target = torch.tensor([[1.0]])
optimizer.zero_grad()
output = model(inputs)
loss = (output - target).pow(2).mean()
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=0.1)
for p in model.parameters():
print(p.grad.norm().item())Solution
Step 1: Understand code flow and gradient clipping
Gradients are computed by loss.backward(), then clipped by clip_grad_norm_ with max_norm=0.1.Step 2: Effect of clip_grad_norm_ on gradients
clip_grad_norm_ rescales gradients so their total norm does not exceed 0.1, so printed norms will be ≤ 0.1.Final Answer:
All printed gradient norms will be less than or equal to 0.1 -> Option DQuick Check:
clip_grad_norm_ limits gradient norm ≤ max_norm [OK]
- Calling clip_grad_norm_ before backward()
- Expecting gradients to be zero after clipping
- Thinking clipping increases gradient norms
optimizer.zero_grad() output = model(data) loss = criterion(output, target) clip_grad_norm_(model.parameters(), max_norm=1.0) loss.backward() optimizer.step()
Solution
Step 1: Check order of operations for gradient clipping
Gradients are created by loss.backward(), so clipping must happen after backward() to affect gradients.Step 2: Identify mistake in code order
clip_grad_norm_ is called before loss.backward(), so gradients do not exist yet and clipping has no effect.Final Answer:
clip_grad_norm_ is called before loss.backward(), so gradients are not clipped -> Option BQuick Check:
Clip gradients after backward() [OK]
- Clipping before backward()
- Calling zero_grad() after step()
- Setting max_norm to zero
Solution
Step 1: Understand correct gradient clipping sequence
Gradients are computed by loss.backward(), so clipping must happen after this step and before optimizer.step().Step 2: Identify correct function and order
clip_grad_norm_ is used to clip by norm, suitable for RNNs to prevent exploding gradients. It must be called after backward() and before optimizer.step().Final Answer:
After loss.backward(), call torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5), then optimizer.step() -> Option AQuick Check:
Clip gradients after backward(), before step() [OK]
- Clipping before backward()
- Clipping after optimizer.step()
- Using clip_grad_value_ incorrectly
