When fine-tuning models with Hugging Face, the key metrics depend on the task. For text classification, accuracy shows how many texts are correctly labeled. For tasks like question answering or summarization, metrics like F1 score or ROUGE measure quality better. These metrics help us know if the model learned well from new data.
Hugging Face fine-tuning in Prompt Engineering / GenAI - Model Metrics & Evaluation
Actual \ Predicted | Positive | Negative
-------------------|----------|---------
Positive | 80 | 20
Negative | 10 | 90
Here, 80 true positives (TP), 90 true negatives (TN), 10 false negatives (FN), and 20 false positives (FP) help calculate precision, recall, and accuracy.
Fine-tuning a spam detector: high precision means fewer good emails marked as spam (less annoyance). High recall means catching most spam emails (better filtering). Depending on what matters more, you adjust the model or threshold.
For medical text classification, high recall is critical to catch all disease mentions, even if some false alarms happen.
Good: Accuracy above 85%, F1 score above 0.8, balanced precision and recall close to each other.
Bad: Accuracy near random chance (e.g., 50% for two classes), very low recall (missing many positives), or very low precision (too many false alarms).
- Accuracy paradox: High accuracy but poor recall if data is imbalanced.
- Data leakage: Training data accidentally includes test examples, inflating metrics.
- Overfitting: Training metrics look great but test metrics drop, showing poor generalization.
- Ignoring task-specific metrics: Using accuracy for generation tasks where BLEU or ROUGE is better.
Your fine-tuned model shows 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most fraud cases, which is dangerous. High accuracy is misleading because fraud is rare, so the model mostly predicts non-fraud correctly but fails to catch fraud.