How to use AdamW optimizer pytorch

PytorchHow-ToBeginner · 3 min read

How to Use AdamW Optimizer in PyTorch: Syntax and Example

Use torch.optim.AdamW by passing your model parameters and learning rate to create the optimizer. Then call optimizer.step() after computing gradients to update model weights with AdamW's weight decay regularization.

📐

Syntax

The AdamW optimizer in PyTorch is created by calling torch.optim.AdamW with the model parameters and optional settings like learning rate and weight decay.

params: The model parameters to optimize, usually model.parameters().
lr: Learning rate controlling step size (default 0.001).
weight_decay: Weight decay factor for regularization (default 0.01).
betas: Coefficients for computing running averages of gradient and its square (default (0.9, 0.999)).

python

optimizer = torch.optim.AdamW(params=model.parameters(), lr=0.001, weight_decay=0.01)

💻

Example

This example shows how to use AdamW to train a simple linear model on dummy data. It demonstrates creating the optimizer, computing loss, backpropagation, and updating weights.

python

import torch
import torch.nn as nn

# Simple linear model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# Create model and optimizer
model = SimpleModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01, weight_decay=0.01)

# Dummy data
x = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]])

# Loss function
criterion = nn.MSELoss()

# Training loop for 5 steps
for step in range(5):
    optimizer.zero_grad()  # Clear gradients
    outputs = model(x)     # Forward pass
    loss = criterion(outputs, y)  # Compute loss
    loss.backward()       # Backpropagation
    optimizer.step()      # Update weights
    print(f"Step {step+1}, Loss: {loss.item():.4f}")

Output

Step 1, Loss: 22.6977 Step 2, Loss: 1.0119 Step 3, Loss: 0.0547 Step 4, Loss: 0.0041 Step 5, Loss: 0.0004

⚠️

Common Pitfalls

Common mistakes when using AdamW include:

Not calling optimizer.zero_grad() before loss.backward(), causing gradient accumulation.
Confusing weight_decay with L2 regularization in Adam (AdamW decouples weight decay properly).
Passing incorrect parameters to the optimizer, such as forgetting model.parameters().

python

import torch
import torch.nn as nn

model = nn.Linear(1, 1)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01, weight_decay=0.01)

x = torch.tensor([[1.0]])
y = torch.tensor([[2.0]])
criterion = nn.MSELoss()

# Wrong: missing optimizer.zero_grad()
outputs = model(x)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()

# Right way:
optimizer.zero_grad()
outputs = model(x)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()

📊

Quick Reference

Tips for using AdamW optimizer in PyTorch:

Use optimizer.zero_grad() before backpropagation to reset gradients.
Set weight_decay to apply proper weight decay regularization.
Adjust lr (learning rate) to control training speed and stability.
AdamW is preferred over Adam when weight decay is needed because it decouples weight decay from gradient updates.

✅

Key Takeaways

Create AdamW optimizer with model parameters and set learning rate and weight decay.

Always call optimizer.zero_grad() before loss.backward() to clear old gradients.

Use optimizer.step() to update model weights after backpropagation.

AdamW properly applies weight decay separate from gradient updates, unlike Adam.

Tune learning rate and weight decay for best training results.