How to Use Mixed Precision Training in PyTorch for Faster Models
Use
torch.cuda.amp.autocast() to run your model forward pass in mixed precision and torch.cuda.amp.GradScaler() to scale gradients during backpropagation. This combination speeds up training and reduces GPU memory usage while keeping model accuracy.Syntax
Mixed precision training in PyTorch uses two main components:
torch.cuda.amp.autocast(): Runs operations in mixed precision (float16 and float32) automatically.torch.cuda.amp.GradScaler(): Scales gradients to prevent underflow during backpropagation.
Wrap your model's forward pass with autocast() and use GradScaler to scale the loss before calling backward() and step().
python
scaler = torch.cuda.amp.GradScaler() for data, target in dataloader: optimizer.zero_grad() with torch.cuda.amp.autocast(): output = model(data) loss = loss_fn(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Example
This example shows a simple training loop using mixed precision on a dummy dataset with a basic neural network. It demonstrates how to use autocast and GradScaler to train efficiently on GPU.
python
import torch import torch.nn as nn import torch.optim as optim from torch.cuda.amp import autocast, GradScaler # Simple model class SimpleNet(nn.Module): def __init__(self): super().__init__() self.fc = nn.Linear(10, 1) def forward(self, x): return self.fc(x) # Setup device = 'cuda' if torch.cuda.is_available() else 'cpu' model = SimpleNet().to(device) optimizer = optim.SGD(model.parameters(), lr=0.01) loss_fn = nn.MSELoss() scaler = GradScaler() # Dummy data inputs = torch.randn(100, 10).to(device) targets = torch.randn(100, 1).to(device) # Training loop model.train() for epoch in range(3): optimizer.zero_grad() with autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
Output
Epoch 1, Loss: 1.1234
Epoch 2, Loss: 0.9876
Epoch 3, Loss: 0.8765
Common Pitfalls
- Not using
autocast()during the forward pass causes no speedup or memory benefit. - Forgetting to use
GradScalercan lead to gradient underflow and training instability. - Calling
optimizer.step()withoutscaler.step()will break scaling and cause errors. - Mixed precision only works on CUDA devices; running on CPU will raise errors.
Always check if CUDA is available before using mixed precision.
python
import torch from torch.cuda.amp import autocast, GradScaler # Wrong way: no autocast and no scaler optimizer.zero_grad() outputs = model(inputs) # no autocast loss = loss_fn(outputs, targets) loss.backward() optimizer.step() # no scaler # Right way: scaler = GradScaler() optimizer.zero_grad() with autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Quick Reference
| Step | Code Snippet | Purpose |
|---|---|---|
| 1 | with torch.cuda.amp.autocast(): | Run forward pass in mixed precision |
| 2 | scaler = torch.cuda.amp.GradScaler() | Create gradient scaler |
| 3 | scaler.scale(loss).backward() | Scale loss and backpropagate |
| 4 | scaler.step(optimizer) | Update optimizer with scaled gradients |
| 5 | scaler.update() | Update scaler for next iteration |
Key Takeaways
Use torch.cuda.amp.autocast() to enable mixed precision during the forward pass.
Use torch.cuda.amp.GradScaler() to safely scale gradients and avoid underflow.
Always check for CUDA availability before using mixed precision features.
Wrap loss.backward() and optimizer.step() calls with scaler.scale() and scaler.step().
Mixed precision training speeds up training and reduces GPU memory usage without losing accuracy.