In text classification, we want to know how well the model sorts texts into correct groups. The key metrics are Precision, Recall, and F1 score. Precision tells us how many texts labeled as a category really belong there. Recall tells us how many texts of a category the model found. F1 score balances both. These matter because sometimes we want to avoid wrong labels (high precision), or catch all relevant texts (high recall), depending on the task.
Text classification pipeline in ML Python - Model Metrics & Evaluation
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP): 80 | False Negative (FN): 20 |
| False Positive (FP): 10 | True Negative (TN): 90 |
Total samples = TP + FP + TN + FN = 80 + 10 + 90 + 20 = 200
Precision = TP / (TP + FP) = 80 / (80 + 10) = 0.89
Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.80
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84
Imagine a spam email filter:
- High Precision: Few good emails are wrongly marked as spam. Users don't miss important emails.
- High Recall: Most spam emails are caught, but some good emails might be wrongly marked as spam.
Depending on what matters more, we adjust the model. For spam, high precision is often preferred to avoid losing good emails.
In contrast, for a news topic classifier, high recall might be more important to find all relevant articles.
Good metrics: Precision and recall both above 0.8 means the model labels texts correctly and finds most relevant texts.
Bad metrics: Precision or recall below 0.5 means many wrong labels or many missed texts. For example, precision 0.3 means many false positives.
Accuracy alone can be misleading if classes are unbalanced. For example, 90% accuracy might happen if the model always predicts the majority class.
- Accuracy paradox: High accuracy but poor precision or recall due to unbalanced classes.
- Data leakage: When test data leaks into training, metrics look unrealistically high.
- Overfitting: Very high training metrics but low test metrics means the model memorizes training data, not generalizing well.
- Ignoring class imbalance: Not using metrics like F1 or weighted scores can hide poor performance on smaller classes.
Your text classification model has 98% accuracy but only 12% recall on a rare class. Is it good for production?
Answer: No. The model misses most examples of the rare class (low recall). Even with high accuracy, it fails to find important texts in that class. This is a problem if catching that class matters.