0
0
PyTorchml~5 mins

Learning rate differential in PyTorch

Choose your learning style9 modes available
Introduction

Learning rate differential means using different learning rates for different parts of a model. This helps the model learn better by adjusting how fast each part changes.

When fine-tuning a pre-trained model and you want to train new layers faster than old layers.
When different parts of the model learn at different speeds and need separate learning rates.
When combining a big model with a small new module and you want to control their training speeds.
When experimenting to improve training stability by slowing down some layers.
Syntax
PyTorch
optimizer = torch.optim.SGD([
    {'params': model.part1.parameters(), 'lr': 0.001},
    {'params': model.part2.parameters(), 'lr': 0.01}
], momentum=0.9)

You pass a list of dictionaries to the optimizer, each with its own learning rate.

Each dictionary must have a 'params' key with the parameters and a 'lr' key for learning rate.

Examples
Using Adam optimizer with a smaller learning rate for the base and a larger one for the head.
PyTorch
optimizer = torch.optim.Adam([
    {'params': model.base.parameters(), 'lr': 0.0001},
    {'params': model.head.parameters(), 'lr': 0.001}
])
Using SGD with momentum and different learning rates for two layers.
PyTorch
optimizer = torch.optim.SGD([
    {'params': model.layer1.parameters(), 'lr': 0.01},
    {'params': model.layer2.parameters(), 'lr': 0.001}
], momentum=0.9)
Sample Model

This code shows a simple model with two parts. We use different learning rates for each part in the optimizer. The training loop runs 3 times and prints the loss each time.

PyTorch
import torch
import torch.nn as nn
import torch.optim as optim

# Simple model with two parts
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.part1 = nn.Linear(10, 5)
        self.part2 = nn.Linear(5, 2)

    def forward(self, x):
        x = torch.relu(self.part1(x))
        x = self.part2(x)
        return x

model = SimpleModel()

# Create dummy data
inputs = torch.randn(8, 10)
targets = torch.randint(0, 2, (8,))

# Loss function
criterion = nn.CrossEntropyLoss()

# Optimizer with learning rate differential
optimizer = optim.SGD([
    {'params': model.part1.parameters(), 'lr': 0.001},
    {'params': model.part2.parameters(), 'lr': 0.01}
], momentum=0.9)

# Training loop for 3 epochs
for epoch in range(3):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
OutputSuccess
Important Notes

Using different learning rates can help when some layers need slower or faster updates.

Make sure to pass the correct parameters to each learning rate group.

Learning rate differential is common in transfer learning and fine-tuning.

Summary

Learning rate differential means setting different learning rates for parts of a model.

This helps control how fast each part learns during training.

It is useful for fine-tuning and improving training results.