For text classification using RNNs, the key metrics are accuracy, precision, recall, and F1-score. Accuracy tells us how many texts were correctly labeled overall. But if classes are uneven, precision and recall help us understand how well the model finds each class. F1-score balances precision and recall, giving a fair view when classes are imbalanced.
RNN for text classification in NLP - Model Metrics & Evaluation
Actual \ Predicted | Positive | Negative
-------------------|----------|---------
Positive | 80 | 20
Negative | 10 | 90
Total samples = 200
TP = 80, FP = 10, TN = 90, FN = 20
From this matrix, we calculate:
- Precision = 80 / (80 + 10) = 0.89
- Recall = 80 / (80 + 20) = 0.80
- F1-score = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84
- Accuracy = (80 + 90) / 200 = 0.85
Imagine a spam filter using an RNN:
- High precision: Few good emails are wrongly marked as spam. Users don't miss important messages.
- High recall: Most spam emails are caught, but some good emails might be wrongly flagged.
Depending on what matters more, you tune the model to favor precision or recall.
Good: Accuracy above 85%, precision and recall above 80%, and balanced F1-score. This means the model correctly classifies most texts and handles class imbalance well.
Bad: High accuracy but very low recall (e.g., 30%) means the model misses many positive cases. Or high recall but very low precision means many false alarms.
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced.
- Data leakage: If test data leaks into training, metrics look unrealistically good.
- Overfitting: Very high training accuracy but low test accuracy means the model memorizes instead of learning.
Your RNN text classifier has 98% accuracy but only 12% recall on the positive class. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most positive cases (low recall), which is critical if positive detection matters. High accuracy is misleading because the negative class dominates.