When fine-tuning a model, the key metrics depend on the task. For classification, accuracy, precision, recall, and F1 score are important to check if the model learned well on new data. For regression, mean squared error or R-squared matter. Fine-tuning aims to improve performance on a specific task without losing general knowledge, so monitoring validation loss and validation metrics helps detect if the model is improving or overfitting.
0
0
Fine-tuning strategy in PyTorch - Model Metrics & Evaluation
Metrics & Evaluation - Fine-tuning strategy
Which metric matters for Fine-tuning strategy and WHY
Confusion matrix example for fine-tuned classification model
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP) | True Negative (TN) |
Example:
TP = 85, FP = 15, TN = 90, FN = 10
Total samples = 85 + 15 + 90 + 10 = 200
From this, we calculate:
- Precision = TP / (TP + FP) = 85 / (85 + 15) = 0.85
- Recall = TP / (TP + FN) = 85 / (85 + 10) = 0.8947
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.871
Precision vs Recall tradeoff in Fine-tuning
Fine-tuning can shift the balance between precision and recall. For example:
- If you fine-tune a spam detector, you want high precision to avoid marking good emails as spam.
- If you fine-tune a medical diagnosis model, you want high recall to catch as many true cases as possible, even if some false alarms occur.
Choosing which metric to prioritize depends on the real-world cost of mistakes.
Good vs Bad metric values for Fine-tuning
Good fine-tuning results show:
- Validation loss decreases steadily.
- Accuracy, precision, recall, and F1 improve compared to the base model.
- Balanced precision and recall if both matter.
Bad fine-tuning results show:
- Validation loss plateaus or increases (overfitting).
- Metrics on validation data do not improve or get worse.
- Very high precision but very low recall or vice versa without justification.
Common pitfalls in Fine-tuning metrics
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced.
- Data leakage: Using test data during fine-tuning inflates metrics falsely.
- Overfitting: Metrics improve on training but worsen on validation.
- Ignoring metric tradeoffs: Focusing only on accuracy without checking precision and recall.
- Not monitoring validation metrics: Only training metrics can hide poor generalization.
Self-check question
Your fine-tuned model has 98% accuracy but only 12% recall on the fraud class. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most fraud cases, which is critical in fraud detection. High accuracy is misleading because fraud is rare, so the model mostly predicts non-fraud correctly but fails to catch fraud. You should improve recall even if accuracy drops a bit.
Key Result
Fine-tuning success is best judged by balanced improvements in validation loss, precision, recall, and F1 score relevant to the task.