When working with imbalanced text data, accuracy can be misleading because the model might just guess the majority class and still get high accuracy. Instead, Precision, Recall, and F1-score are more useful. They help us understand how well the model finds the rare but important classes (like spam or fraud) without too many mistakes.
Handling imbalanced text data in NLP - Model Metrics & Evaluation
Actual \ Predicted | Positive | Negative
-------------------|----------|---------
Positive | 40 | 10
Negative | 20 | 930
Here, TP=40, FN=10, FP=20, TN=930. Total samples = 1000.
Imagine a spam filter. If it marks too many good emails as spam (low precision), people get annoyed. If it misses spam emails (low recall), spam floods inboxes. So, we balance precision and recall depending on what matters more.
For imbalanced text data, improving recall means catching more rare cases, but might lower precision (more false alarms). Improving precision means fewer false alarms but might miss some rare cases.
Good: Precision and recall both above 0.7, F1-score balanced around 0.7 or higher. This means the model finds many rare cases and makes few mistakes.
Bad: High accuracy (like 95%) but precision or recall below 0.2. This means the model mostly guesses the majority class and misses rare but important cases.
- Accuracy paradox: High accuracy but poor detection of minority class.
- Data leakage: When test data leaks into training, metrics look better but model fails in real use.
- Overfitting: Model performs well on training but poorly on new data, metrics drop on validation.
Your model has 98% accuracy but only 12% recall on the rare fraud class. Is it good for production?
Answer: No. The model misses 88% of fraud cases, which is dangerous. Despite high accuracy, low recall means it fails to catch most frauds. You should improve recall before using it.