In training deep neural networks, why might we assign different learning rates to different layers?
Think about how early layers and later layers in a neural network behave differently during training.
Early layers capture general patterns and should change slowly to preserve learned features, so they use smaller learning rates. Later layers adapt to specific tasks and can learn faster with larger learning rates.
What will be the learning rate of the parameters in model.layer1 and model.layer2 after this code runs?
import torch import torch.nn as nn class SimpleModel(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(10, 5) self.layer2 = nn.Linear(5, 2) model = SimpleModel() optimizer = torch.optim.SGD([ {'params': model.layer1.parameters(), 'lr': 0.001}, {'params': model.layer2.parameters(), 'lr': 0.01} ], momentum=0.9) lrs = [group['lr'] for group in optimizer.param_groups] print(lrs)
Look at how the optimizer parameter groups are defined with different learning rates.
The optimizer has two parameter groups: one for layer1 with learning rate 0.001, and one for layer2 with learning rate 0.01. The printed list shows these learning rates in order.
You want to fine-tune a pretrained model by freezing early layers and training only the last few layers. Which learning rate setup is best?
Frozen layers should not update during training.
Frozen layers should have zero learning rate to prevent updates. Trainable layers should have a small learning rate to fine-tune without large jumps.
During training with differential learning rates, you notice the loss decreases quickly at first but then plateaus. What is a likely cause?
Consider which layers learn task-specific features and how their learning rate affects training speed.
If later layers have too low learning rates, they learn slowly, causing loss to plateau early. Adjusting their learning rate can help continue loss reduction.
What error does this PyTorch code raise when trying to set different learning rates?
import torch import torch.nn as nn model = nn.Sequential( nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 2) ) optimizer = torch.optim.Adam([ {'params': model[0].parameters(), 'lr': 0.001}, {'params': model[1].parameters(), 'lr': 0.01} ])
Check which layers have parameters and which do not.
Activation layers like ReLU do not have parameters, so calling parameters() on them raises an AttributeError.