When fine-tuning a pre-trained neural network, why is it common to use a smaller learning rate compared to training from scratch?
Think about what happens if you change the pre-trained weights too much.
Fine-tuning starts from a model that already learned useful features. Using a smaller learning rate ensures updates are gentle, preserving these features while adapting to new data.
Consider this TensorFlow code snippet that sets a learning rate schedule for fine-tuning:
import tensorflow as tf
initial_lr = 0.001
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=initial_lr,
decay_steps=1000,
decay_rate=0.96,
staircase=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
lrs = [optimizer.learning_rate(step).numpy() for step in [0, 1000, 2000, 3000]]
print(lrs)What is the output printed?
Recall that with staircase=True, the learning rate decays in steps at multiples of decay_steps.
The learning rate decays by multiplying by 0.96 every 1000 steps. At step 0: 0.001, step 1000: 0.001*0.96=0.00096, step 2000: 0.00096*0.96=0.0009216, step 3000: 0.0009216*0.96=0.000884736.
You are fine-tuning a large pre-trained image classification model on a small dataset. Which learning rate choice is most appropriate to avoid overfitting and preserve learned features?
Think about the size of the dataset and the risk of losing pre-trained knowledge.
A low learning rate allows the model to adjust gently to new data without large changes that could erase useful pre-trained features, especially important with small datasets.
During fine-tuning, you observe the following behavior:
- Training loss decreases steadily.
- Validation loss starts increasing after a few epochs.
What does this suggest about the learning rate and model behavior?
Think about what causes validation loss to increase while training loss decreases.
When validation loss increases but training loss decreases, the model is overfitting. A high learning rate can cause unstable updates that lead to overfitting during fine-tuning.
You fine-tune a pre-trained model with this optimizer setup:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
After a few batches, the training loss becomes NaN and the model stops learning. What is the most likely cause?
Consider what happens when the learning rate is set too high in gradient-based optimization.
A high learning rate like 0.01 can cause very large updates, leading to unstable training and NaN loss values due to exploding gradients.