Learning rate differential means using different learning rates for different parts of a model. This helps the model learn better by adjusting how fast each part changes.
Learning rate differential in PyTorch
Start learning this pattern below
Jump into concepts and practice - no test required
optimizer = torch.optim.SGD([
{'params': model.part1.parameters(), 'lr': 0.001},
{'params': model.part2.parameters(), 'lr': 0.01}
], momentum=0.9)You pass a list of dictionaries to the optimizer, each with its own learning rate.
Each dictionary must have a 'params' key with the parameters and a 'lr' key for learning rate.
optimizer = torch.optim.Adam([
{'params': model.base.parameters(), 'lr': 0.0001},
{'params': model.head.parameters(), 'lr': 0.001}
])optimizer = torch.optim.SGD([
{'params': model.layer1.parameters(), 'lr': 0.01},
{'params': model.layer2.parameters(), 'lr': 0.001}
], momentum=0.9)This code shows a simple model with two parts. We use different learning rates for each part in the optimizer. The training loop runs 3 times and prints the loss each time.
import torch import torch.nn as nn import torch.optim as optim # Simple model with two parts class SimpleModel(nn.Module): def __init__(self): super().__init__() self.part1 = nn.Linear(10, 5) self.part2 = nn.Linear(5, 2) def forward(self, x): x = torch.relu(self.part1(x)) x = self.part2(x) return x model = SimpleModel() # Create dummy data inputs = torch.randn(8, 10) targets = torch.randint(0, 2, (8,)) # Loss function criterion = nn.CrossEntropyLoss() # Optimizer with learning rate differential optimizer = optim.SGD([ {'params': model.part1.parameters(), 'lr': 0.001}, {'params': model.part2.parameters(), 'lr': 0.01} ], momentum=0.9) # Training loop for 3 epochs for epoch in range(3): optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
Using different learning rates can help when some layers need slower or faster updates.
Make sure to pass the correct parameters to each learning rate group.
Learning rate differential is common in transfer learning and fine-tuning.
Learning rate differential means setting different learning rates for parts of a model.
This helps control how fast each part learns during training.
It is useful for fine-tuning and improving training results.
Practice
learning rate differential mean in PyTorch training?Solution
Step 1: Understand learning rate concept
The learning rate controls how fast a model updates its knowledge during training.Step 2: Define learning rate differential
Learning rate differential means assigning different learning rates to different parts of the model to control their update speed.Final Answer:
Setting different learning rates for different parts of a model -> Option BQuick Check:
Learning rate differential = Different rates per model part [OK]
- Thinking learning rate is always the same for all layers
- Confusing learning rate differential with random rate changes
- Believing freezing layers means changing learning rate
Solution
Step 1: Check PyTorch optimizer syntax for param groups
PyTorch allows passing a list of dicts with 'params' and 'lr' keys to set different learning rates.Step 2: Identify correct syntax
optimizer = torch.optim.SGD([{'params': model.layer1.parameters(), 'lr': 0.01}, {'params': model.layer2.parameters(), 'lr': 0.001}], momentum=0.9) correctly uses a list of dicts with separate learning rates for layer1 and layer2 parameters.Final Answer:
optimizer = torch.optim.SGD([{'params': model.layer1.parameters(), 'lr': 0.01}, {'params': model.layer2.parameters(), 'lr': 0.001}], momentum=0.9) -> Option CQuick Check:
Param groups with separate 'lr' keys = Correct syntax [OK]
- Passing lr as a list directly to optimizer
- Using unknown keyword like lr2
- Passing layers instead of parameters
model.layer2 during training?optimizer = torch.optim.Adam([
{'params': model.layer1.parameters(), 'lr': 0.005},
{'params': model.layer2.parameters(), 'lr': 0.0005}
])Solution
Step 1: Identify learning rates assigned to each layer
Layer1 has lr=0.005, Layer2 has lr=0.0005 as per the optimizer param groups.Step 2: Find learning rate for model.layer2
From the second dict, model.layer2.parameters() uses lr=0.0005.Final Answer:
0.0005 -> Option AQuick Check:
Layer2 lr = 0.0005 from param groups [OK]
- Adding learning rates instead of selecting correct one
- Confusing layer1 lr with layer2 lr
- Assuming default lr overrides param groups
optimizer = torch.optim.SGD([
{'params': model.layer1.parameters(), 'lr': 0.01},
{'params': model.layer2.parameters()}
], lr=0.001)Solution
Step 1: Review param groups and learning rates
First param group has lr=0.01, second param group has no lr specified.Step 2: Understand default lr behavior
When param groups are used, each group should have lr or optimizer's lr applies. Here, lr=0.001 is passed but second group lacks explicit lr, causing confusion.Final Answer:
Missing learning rate for second param group causes error -> Option AQuick Check:
All param groups need lr or default applies [OK]
- Assuming optimizer lr applies to all param groups automatically
- Passing parameters instead of parameter iterators
- Believing SGD can't use param groups
Solution
Step 1: Understand freezing and learning rate
Freezing means no updates, which can be done by setting lr=0 or disabling gradients.Step 2: Apply learning rate differential for fine-tuning
Set lr=0 for frozen layers to prevent updates, and higher lr for last layer to train it fast.Final Answer:
Set lr=0 for all layers except last layer with lr=0.01 -> Option DQuick Check:
Freeze layers = lr 0, train last layer fast [OK]
- Using same learning rate for all layers when freezing
- Freezing last layer instead of others
- Not setting lr=0 for frozen layers
