How to Use AdamW Optimizer in PyTorch: Syntax and Example
Use
torch.optim.AdamW by passing your model parameters and learning rate to create the optimizer. Then call optimizer.step() after computing gradients to update model weights with AdamW's weight decay regularization.Syntax
The AdamW optimizer in PyTorch is created by calling torch.optim.AdamW with the model parameters and optional settings like learning rate and weight decay.
- params: The model parameters to optimize, usually
model.parameters(). - lr: Learning rate controlling step size (default 0.001).
- weight_decay: Weight decay factor for regularization (default 0.01).
- betas: Coefficients for computing running averages of gradient and its square (default (0.9, 0.999)).
python
optimizer = torch.optim.AdamW(params=model.parameters(), lr=0.001, weight_decay=0.01)
Example
This example shows how to use AdamW to train a simple linear model on dummy data. It demonstrates creating the optimizer, computing loss, backpropagation, and updating weights.
python
import torch import torch.nn as nn # Simple linear model class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(1, 1) def forward(self, x): return self.linear(x) # Create model and optimizer model = SimpleModel() optimizer = torch.optim.AdamW(model.parameters(), lr=0.01, weight_decay=0.01) # Dummy data x = torch.tensor([[1.0], [2.0], [3.0], [4.0]]) y = torch.tensor([[2.0], [4.0], [6.0], [8.0]]) # Loss function criterion = nn.MSELoss() # Training loop for 5 steps for step in range(5): optimizer.zero_grad() # Clear gradients outputs = model(x) # Forward pass loss = criterion(outputs, y) # Compute loss loss.backward() # Backpropagation optimizer.step() # Update weights print(f"Step {step+1}, Loss: {loss.item():.4f}")
Output
Step 1, Loss: 22.6977
Step 2, Loss: 1.0119
Step 3, Loss: 0.0547
Step 4, Loss: 0.0041
Step 5, Loss: 0.0004
Common Pitfalls
Common mistakes when using AdamW include:
- Not calling
optimizer.zero_grad()beforeloss.backward(), causing gradient accumulation. - Confusing
weight_decaywith L2 regularization in Adam (AdamW decouples weight decay properly). - Passing incorrect parameters to the optimizer, such as forgetting
model.parameters().
python
import torch import torch.nn as nn model = nn.Linear(1, 1) optimizer = torch.optim.AdamW(model.parameters(), lr=0.01, weight_decay=0.01) x = torch.tensor([[1.0]]) y = torch.tensor([[2.0]]) criterion = nn.MSELoss() # Wrong: missing optimizer.zero_grad() outputs = model(x) loss = criterion(outputs, y) loss.backward() optimizer.step() # Right way: optimizer.zero_grad() outputs = model(x) loss = criterion(outputs, y) loss.backward() optimizer.step()
Quick Reference
Tips for using AdamW optimizer in PyTorch:
- Use
optimizer.zero_grad()before backpropagation to reset gradients. - Set
weight_decayto apply proper weight decay regularization. - Adjust
lr(learning rate) to control training speed and stability. - AdamW is preferred over Adam when weight decay is needed because it decouples weight decay from gradient updates.
Key Takeaways
Create AdamW optimizer with model parameters and set learning rate and weight decay.
Always call optimizer.zero_grad() before loss.backward() to clear old gradients.
Use optimizer.step() to update model weights after backpropagation.
AdamW properly applies weight decay separate from gradient updates, unlike Adam.
Tune learning rate and weight decay for best training results.