When we fine-tune BERT for classification, the main goal is to correctly label text into categories. The key metrics to check are accuracy, precision, recall, and F1 score. Accuracy tells us overall how many texts were labeled right. Precision shows how many predicted labels were actually correct. Recall tells us how many true labels we found out of all real ones. F1 score balances precision and recall, which is important when classes are uneven or mistakes have different costs.
BERT fine-tuning for classification in NLP - Model Metrics & Evaluation
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP): 80 | False Negative (FN): 20 |
| False Positive (FP): 10 | True Negative (TN): 90 |
Total samples = TP + FP + TN + FN = 80 + 10 + 90 + 20 = 200
Precision = TP / (TP + FP) = 80 / (80 + 10) = 0.89
Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.80
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84
Imagine BERT is classifying emails as spam or not spam.
- High Precision: Few good emails are wrongly marked as spam. This means users don't miss important emails. But some spam might get through.
- High Recall: Most spam emails are caught. But some good emails might be wrongly marked as spam, annoying users.
Depending on what matters more, we adjust the model or threshold. For spam, usually high precision is preferred to avoid losing good emails.
Good: Accuracy above 85%, Precision and Recall above 80%, and F1 score balanced near 0.8 or higher. This means the model predicts well and finds most true labels without many mistakes.
Bad: Accuracy near 50% (like random guessing), Precision or Recall below 50%, or very unbalanced F1 score (e.g., high precision but very low recall). This means the model is not reliable or misses many true cases.
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, if 90% of texts are class A, predicting all as A gives 90% accuracy but no real learning.
- Data leakage: If test data leaks into training, metrics look too good but model fails in real use.
- Overfitting: Very high training accuracy but low test accuracy means model memorized training data, not learned general patterns.
Your BERT model has 98% accuracy but only 12% recall on the positive class (e.g., fraud detection). Is this good for production? Why or why not?
Answer: No, it is not good. The model misses 88% of actual positive cases, which is very risky in fraud detection. High accuracy is misleading because most data is negative. You need to improve recall to catch more fraud cases.