In multi-class text classification, the goal is to correctly assign each text to one category out of many. The key metrics are accuracy, precision, recall, and F1-score calculated per class and averaged. Accuracy shows overall correct predictions. Precision tells us how many predicted texts for a class truly belong there. Recall shows how many texts of a class were found by the model. F1-score balances precision and recall. These metrics help us understand if the model is good at finding and correctly labeling each class.
Multi-class text classification in NLP - Model Metrics & Evaluation
Here is an example confusion matrix for 3 classes: Sports, Politics, and Tech.
Predicted
S P T
True S 50 2 3
P 4 45 1
T 2 3 48
Explanation:
- 50 texts truly Sports were predicted as Sports (True Positives for Sports)
- 2 Sports texts were wrongly predicted as Politics (False Negatives for Sports)
- 3 Sports texts were wrongly predicted as Tech (False Negatives for Sports)
- Similarly for Politics and Tech classes.
From this matrix, we calculate precision, recall, and F1 for each class.
Imagine a model classifying news articles into categories. If the model has high precision for Sports, it means when it says an article is Sports, it is usually right. But if recall is low, it misses many Sports articles.
For example, if you want to recommend Sports articles only when very sure, prioritize precision. If you want to catch all Sports articles even if some mistakes happen, prioritize recall.
Balancing precision and recall with F1-score helps when both false positives and false negatives matter.
Good metrics:
- Accuracy above 80% on balanced data
- Precision and recall above 75% for each class
- F1-score close to precision and recall, showing balance
Bad metrics:
- Accuracy near random guess (e.g., 33% for 3 classes)
- Very low recall for some classes (missing many texts)
- High precision but very low recall or vice versa (unbalanced)
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, if one class is 90% of data, predicting it always gives 90% accuracy but poor performance on others.
- Ignoring per-class metrics: Overall accuracy hides poor results on minority classes.
- Data leakage: If test data leaks into training, metrics look unrealistically high.
- Overfitting: Very high training accuracy but low test accuracy means model memorizes training data, not generalizing.
Your multi-class text classifier has 98% accuracy but recall for one important class is only 12%. Is this model good for production? Why or why not?
Answer: No, it is not good. The high accuracy likely comes from correctly predicting the majority classes. But the very low recall for the important class means the model misses most texts of that class. This can cause serious problems if that class is critical. You should improve recall for that class before using the model.