0
0
PyTorchml~5 mins

Gradient clipping in PyTorch

Choose your learning style9 modes available
Introduction

Gradient clipping helps keep the training stable by stopping very large updates to the model. It prevents the model from making big jumps that can cause errors.

When training deep neural networks that sometimes have very large gradients.
When the training loss suddenly becomes very large or unstable.
When using recurrent neural networks (RNNs) that can have exploding gradients.
When you want to keep training smooth and avoid the model weights from changing too much at once.
Syntax
PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# or

torch.nn.utils.clip_grad_value_(model.parameters(), clip_value)

clip_grad_norm_ limits the total size (norm) of all gradients combined.

clip_grad_value_ limits each gradient value individually.

Examples
This clips the gradients so their total norm does not exceed 1.0.
PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
This clips each gradient value to be between -0.5 and 0.5.
PyTorch
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
Sample Model

This example shows how to clip gradients to a maximum norm of 1.0 during training. It prints the gradient norm before and after clipping to see the effect.

PyTorch
import torch
import torch.nn as nn
import torch.optim as optim

# Simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 1)

    def forward(self, x):
        return self.linear(x)

# Create model, loss, optimizer
model = SimpleNet()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Dummy data
inputs = torch.tensor([[10.0, 20.0], [30.0, 40.0]])
targets = torch.tensor([[1.0], [2.0]])

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)

# Backward pass
loss.backward()

# Before clipping: print gradient norm
total_norm = 0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm before clipping: {total_norm:.4f}")

# Clip gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# After clipping: print gradient norm
total_norm = 0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm after clipping: {total_norm:.4f}")

# Optimizer step
optimizer.step()
OutputSuccess
Important Notes

Gradient clipping should be done after calling loss.backward() and before optimizer.step().

Clipping helps prevent the problem called 'exploding gradients' which can make training unstable.

Summary

Gradient clipping keeps training stable by limiting how big gradients can get.

Use clip_grad_norm_ to limit total gradient size or clip_grad_value_ to limit individual values.

Always clip gradients after backward pass and before optimizer step.