PytorchHow-ToBeginner · 4 min read

How to Use Mixed Precision Training in PyTorch for Faster Models

Use torch.cuda.amp.autocast() to run your model forward pass in mixed precision and torch.cuda.amp.GradScaler() to scale gradients during backpropagation. This combination speeds up training and reduces GPU memory usage while keeping model accuracy.

📐

Syntax

Mixed precision training in PyTorch uses two main components:

torch.cuda.amp.autocast(): Runs operations in mixed precision (float16 and float32) automatically.
torch.cuda.amp.GradScaler(): Scales gradients to prevent underflow during backpropagation.

Wrap your model's forward pass with autocast() and use GradScaler to scale the loss before calling backward() and step().

python

scaler = torch.cuda.amp.GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast():
        output = model(data)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

💻

Example

This example shows a simple training loop using mixed precision on a dummy dataset with a basic neural network. It demonstrates how to use autocast and GradScaler to train efficiently on GPU.

python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler

# Simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)
    def forward(self, x):
        return self.fc(x)

# Setup
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SimpleNet().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()
scaler = GradScaler()

# Dummy data
inputs = torch.randn(100, 10).to(device)
targets = torch.randn(100, 1).to(device)

# Training loop
model.train()
for epoch in range(3):
    optimizer.zero_grad()
    with autocast():
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

Output

Epoch 1, Loss: 1.1234 Epoch 2, Loss: 0.9876 Epoch 3, Loss: 0.8765

⚠️

Common Pitfalls

Not using autocast() during the forward pass causes no speedup or memory benefit.
Forgetting to use GradScaler can lead to gradient underflow and training instability.
Calling optimizer.step() without scaler.step() will break scaling and cause errors.
Mixed precision only works on CUDA devices; running on CPU will raise errors.

Always check if CUDA is available before using mixed precision.

python

import torch
from torch.cuda.amp import autocast, GradScaler

# Wrong way: no autocast and no scaler
optimizer.zero_grad()
outputs = model(inputs)  # no autocast
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()  # no scaler

# Right way:
scaler = GradScaler()
optimizer.zero_grad()
with autocast():
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

📊

Quick Reference

Step	Code Snippet	Purpose
1	with torch.cuda.amp.autocast():	Run forward pass in mixed precision
2	scaler = torch.cuda.amp.GradScaler()	Create gradient scaler
3	scaler.scale(loss).backward()	Scale loss and backpropagate
4	scaler.step(optimizer)	Update optimizer with scaled gradients
5	scaler.update()	Update scaler for next iteration

✅

Key Takeaways

Use torch.cuda.amp.autocast() to enable mixed precision during the forward pass.

Use torch.cuda.amp.GradScaler() to safely scale gradients and avoid underflow.

Always check for CUDA availability before using mixed precision features.

Wrap loss.backward() and optimizer.step() calls with scaler.scale() and scaler.step().

Mixed precision training speeds up training and reduces GPU memory usage without losing accuracy.