In sentiment analysis, we want to know how well the model can correctly identify positive, negative, or neutral feelings in text. The key metrics are Accuracy, Precision, Recall, and F1-score. Accuracy tells us overall correctness, but because some sentiments might be rare, precision and recall help us understand how well the model finds each sentiment without too many mistakes or misses. F1-score balances precision and recall, giving a single number to compare models.
Sentiment analysis pipeline in NLP - Model Metrics & Evaluation
Predicted
Pos Neg Neu
P 50 5 10
N 3 40 7
U 8 6 60
Legend:
P = Positive actual
N = Negative actual
U = Neutral actual
Numbers = counts of predictions
This matrix shows how many texts were correctly or incorrectly labeled for each sentiment. For example, 50 positive texts were correctly predicted as positive, 5 were wrongly predicted as negative, and 10 as neutral.
Imagine a company uses sentiment analysis to spot unhappy customers (negative sentiment) quickly. Here, recall is very important because missing unhappy customers means lost chances to help them. But if the model marks too many happy customers as unhappy (low precision), it wastes time.
On the other hand, if the company only wants to be sure about unhappy customers before acting, precision matters more to avoid false alarms.
Balancing precision and recall depends on the goal: catching all negatives (high recall) or being very sure about negatives (high precision).
Good: Accuracy above 85%, precision and recall above 80% for each sentiment class, and F1-score close to these values. This means the model correctly finds most sentiments and makes few mistakes.
Bad: Accuracy around 50-60%, precision or recall below 50%, or very unbalanced scores (e.g., high precision but very low recall). This means the model misses many sentiments or wrongly labels many texts.
- Accuracy paradox: If one sentiment is very common, a model guessing only that sentiment can have high accuracy but poor usefulness.
- Data leakage: If test data leaks into training, metrics look unrealistically high.
- Overfitting: Very high training accuracy but low test accuracy means the model memorizes training data but fails on new texts.
- Ignoring class imbalance: Not checking precision and recall per class can hide poor performance on rare sentiments.
Your sentiment analysis model has 98% accuracy but only 12% recall on negative sentiment. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most negative sentiments (low recall), which means unhappy customers might not be detected. High accuracy is misleading if the negative class is rare. Improving recall for negative sentiment is important.