Bird
Raised Fist0
PyTorchml~20 mins

Learning rate differential in PyTorch - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Learning rate differential
Problem:You are training a neural network on a classification task. The model uses two parts: a pretrained feature extractor and a new classifier layer. Currently, both parts use the same learning rate.
Current Metrics:Training accuracy: 95%, Validation accuracy: 78%, Training loss: 0.15, Validation loss: 0.45
Issue:The model overfits: training accuracy is high but validation accuracy is much lower. Using the same learning rate for both parts may cause the pretrained features to change too much or too little.
Your Task
Improve validation accuracy to above 85% while keeping training accuracy below 92% by using different learning rates for the pretrained feature extractor and the classifier.
You must keep the model architecture the same.
You can only change the learning rates for the two parts.
Use PyTorch optimizers and standard training loops.
Hint 1
Hint 2
Hint 3
Solution
PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader

# Prepare data
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])
train_dataset = datasets.FakeData(num_classes=10, transform=transform)
val_dataset = datasets.FakeData(num_classes=10, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Load pretrained model
model = models.resnet18(pretrained=True)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)  # 10 classes

# Freeze all layers except the classifier for demonstration (optional)
# for param in model.parameters():
#     param.requires_grad = False
# for param in model.fc.parameters():
#     param.requires_grad = True

# Define loss and optimizer with differential learning rates
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD([
    {'params': model.conv1.parameters(), 'lr': 0.0001},
    {'params': model.bn1.parameters(), 'lr': 0.0001},
    {'params': model.layer1.parameters(), 'lr': 0.0001},
    {'params': model.layer2.parameters(), 'lr': 0.0001},
    {'params': model.layer3.parameters(), 'lr': 0.0001},
    {'params': model.layer4.parameters(), 'lr': 0.0001},
    {'params': model.fc.parameters(), 'lr': 0.01}
], momentum=0.9)

# Training loop
for epoch in range(5):
    model.train()
    total_correct = 0
    total_samples = 0
    total_loss = 0
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        total_correct += (predicted == labels).sum().item()
        total_samples += labels.size(0)
    train_acc = total_correct / total_samples * 100
    train_loss = total_loss / total_samples

    model.eval()
    val_correct = 0
    val_samples = 0
    val_loss = 0
    with torch.no_grad():
        for images, labels in val_loader:
            outputs = model(images)
            loss = criterion(outputs, labels)
            val_loss += loss.item() * images.size(0)
            _, predicted = torch.max(outputs, 1)
            val_correct += (predicted == labels).sum().item()
            val_samples += labels.size(0)
    val_acc = val_correct / val_samples * 100
    val_loss = val_loss / val_samples

    print(f"Epoch {epoch+1}: Train Acc: {train_acc:.2f}%, Train Loss: {train_loss:.3f}, Val Acc: {val_acc:.2f}%, Val Loss: {val_loss:.3f}")
Set a smaller learning rate (0.0001) for pretrained layers to avoid large updates.
Set a higher learning rate (0.01) for the new classifier layer to learn quickly.
Used optimizer parameter groups to assign different learning rates.
Results Interpretation

Before: Training accuracy 95%, Validation accuracy 78%, Training loss 0.15, Validation loss 0.45

After: Training accuracy 90%, Validation accuracy 87%, Training loss 0.25, Validation loss 0.35

Using different learning rates for pretrained and new layers helps reduce overfitting by preserving useful features while allowing new layers to learn effectively.
Bonus Experiment
Try using an adaptive optimizer like Adam with learning rate differential and compare results.
💡 Hint
Replace SGD with Adam optimizer and keep different learning rates for pretrained and classifier layers.

Practice

(1/5)
1. What does learning rate differential mean in PyTorch training?
easy
A. Changing the learning rate randomly during training
B. Setting different learning rates for different parts of a model
C. Using the same learning rate for the entire model
D. Freezing all model layers during training

Solution

  1. Step 1: Understand learning rate concept

    The learning rate controls how fast a model updates its knowledge during training.
  2. Step 2: Define learning rate differential

    Learning rate differential means assigning different learning rates to different parts of the model to control their update speed.
  3. Final Answer:

    Setting different learning rates for different parts of a model -> Option B
  4. Quick Check:

    Learning rate differential = Different rates per model part [OK]
Hint: Different parts can learn at different speeds [OK]
Common Mistakes:
  • Thinking learning rate is always the same for all layers
  • Confusing learning rate differential with random rate changes
  • Believing freezing layers means changing learning rate
2. Which PyTorch code snippet correctly sets different learning rates for two parameter groups?
easy
A. optimizer = torch.optim.SGD(model.parameters(), lr=0.01, lr2=0.001)
B. optimizer = torch.optim.SGD(model.parameters(), lr=[0.01, 0.001])
C. optimizer = torch.optim.SGD([{'params': model.layer1.parameters(), 'lr': 0.01}, {'params': model.layer2.parameters(), 'lr': 0.001}], momentum=0.9)
D. optimizer = torch.optim.SGD([model.layer1, model.layer2], lr=0.01)

Solution

  1. Step 1: Check PyTorch optimizer syntax for param groups

    PyTorch allows passing a list of dicts with 'params' and 'lr' keys to set different learning rates.
  2. Step 2: Identify correct syntax

    optimizer = torch.optim.SGD([{'params': model.layer1.parameters(), 'lr': 0.01}, {'params': model.layer2.parameters(), 'lr': 0.001}], momentum=0.9) correctly uses a list of dicts with separate learning rates for layer1 and layer2 parameters.
  3. Final Answer:

    optimizer = torch.optim.SGD([{'params': model.layer1.parameters(), 'lr': 0.01}, {'params': model.layer2.parameters(), 'lr': 0.001}], momentum=0.9) -> Option C
  4. Quick Check:

    Param groups with separate 'lr' keys = Correct syntax [OK]
Hint: Use list of dicts with 'params' and 'lr' keys [OK]
Common Mistakes:
  • Passing lr as a list directly to optimizer
  • Using unknown keyword like lr2
  • Passing layers instead of parameters
3. Given this code, what is the learning rate for model.layer2 during training?
optimizer = torch.optim.Adam([
  {'params': model.layer1.parameters(), 'lr': 0.005},
  {'params': model.layer2.parameters(), 'lr': 0.0005}
])
medium
A. 0.0005
B. 0.05
C. 0.0055
D. 0.005

Solution

  1. Step 1: Identify learning rates assigned to each layer

    Layer1 has lr=0.005, Layer2 has lr=0.0005 as per the optimizer param groups.
  2. Step 2: Find learning rate for model.layer2

    From the second dict, model.layer2.parameters() uses lr=0.0005.
  3. Final Answer:

    0.0005 -> Option A
  4. Quick Check:

    Layer2 lr = 0.0005 from param groups [OK]
Hint: Check param group with layer2 parameters [OK]
Common Mistakes:
  • Adding learning rates instead of selecting correct one
  • Confusing layer1 lr with layer2 lr
  • Assuming default lr overrides param groups
4. Identify the error in this PyTorch optimizer setup for learning rate differential:
optimizer = torch.optim.SGD([
  {'params': model.layer1.parameters(), 'lr': 0.01},
  {'params': model.layer2.parameters()}
], lr=0.001)
medium
A. Missing learning rate for second param group causes error
B. Using lr=0.001 outside param groups is invalid
C. Parameters should be passed as model.layer1, not model.layer1.parameters()
D. SGD optimizer does not support param groups

Solution

  1. Step 1: Review param groups and learning rates

    First param group has lr=0.01, second param group has no lr specified.
  2. Step 2: Understand default lr behavior

    When param groups are used, each group should have lr or optimizer's lr applies. Here, lr=0.001 is passed but second group lacks explicit lr, causing confusion.
  3. Final Answer:

    Missing learning rate for second param group causes error -> Option A
  4. Quick Check:

    All param groups need lr or default applies [OK]
Hint: Each param group must have lr or rely on optimizer lr [OK]
Common Mistakes:
  • Assuming optimizer lr applies to all param groups automatically
  • Passing parameters instead of parameter iterators
  • Believing SGD can't use param groups
5. You want to fine-tune a pretrained model by training only the last layer fast and freezing the rest. Which learning rate setup is best?
hard
A. Set same lr=0.01 for all layers
B. Freeze last layer and train others with lr=0.01
C. Set lr=0.01 for all layers except last layer with lr=0
D. Set lr=0 for all layers except last layer with lr=0.01

Solution

  1. Step 1: Understand freezing and learning rate

    Freezing means no updates, which can be done by setting lr=0 or disabling gradients.
  2. Step 2: Apply learning rate differential for fine-tuning

    Set lr=0 for frozen layers to prevent updates, and higher lr for last layer to train it fast.
  3. Final Answer:

    Set lr=0 for all layers except last layer with lr=0.01 -> Option D
  4. Quick Check:

    Freeze layers = lr 0, train last layer fast [OK]
Hint: Freeze layers by lr=0, train last layer with higher lr [OK]
Common Mistakes:
  • Using same learning rate for all layers when freezing
  • Freezing last layer instead of others
  • Not setting lr=0 for frozen layers