0
0
PyTorchml~20 mins

Why learning rate strategy affects convergence in PyTorch - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why learning rate strategy affects convergence
Problem:Train a simple neural network on the MNIST dataset to classify handwritten digits.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.85
Issue:The model overfits: training accuracy is very high but validation accuracy is low, indicating poor generalization.
Your Task
Reduce overfitting by improving validation accuracy to above 85% while keeping training accuracy below 95%.
Keep the model architecture the same (a simple 2-layer fully connected network).
Only change the learning rate strategy (learning rate value and scheduler).
Use PyTorch for implementation.
Hint 1
Hint 2
Hint 3
Solution
PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define simple 2-layer neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28*28, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)
    def forward(self, x):
        x = x.view(-1, 28*28)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Load MNIST dataset
transform = transforms.ToTensor()
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
val_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1000, shuffle=False)

# Initialize model, loss, optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.05)  # Lower initial learning rate

# Learning rate scheduler: StepLR reduces lr by 0.5 every 5 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

def train():
    model.train()
    total_loss = 0
    correct = 0
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * data.size(0)
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
    return total_loss / len(train_loader.dataset), correct / len(train_loader.dataset)

def validate():
    model.eval()
    total_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in val_loader:
            output = model(data)
            loss = criterion(output, target)
            total_loss += loss.item() * data.size(0)
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
    return total_loss / len(val_loader.dataset), correct / len(val_loader.dataset)

# Training loop
num_epochs = 15
for epoch in range(1, num_epochs + 1):
    train_loss, train_acc = train()
    val_loss, val_acc = validate()
    scheduler.step()
    print(f'Epoch {epoch}: Train loss {train_loss:.4f}, Train acc {train_acc:.4f}, Val loss {val_loss:.4f}, Val acc {val_acc:.4f}')
Reduced initial learning rate from 0.1 to 0.05 to avoid large weight updates.
Added StepLR scheduler to reduce learning rate by half every 5 epochs to help convergence.
Kept model architecture and other hyperparameters unchanged.
Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.85

After: Training accuracy 93%, Validation accuracy 87%, Training loss 0.15, Validation loss 0.35

Using a smaller initial learning rate and reducing it gradually during training helps the model converge better. This reduces overfitting by preventing the model from fitting noise in training data and improves validation accuracy.
Bonus Experiment
Try using a cosine annealing learning rate scheduler instead of StepLR and observe the effect on convergence and accuracy.
💡 Hint
Cosine annealing gradually reduces the learning rate following a cosine curve, which can help the model escape local minima and improve generalization.