0
0
NLPml~8 mins

Fine-grained sentiment (5-class) in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Fine-grained sentiment (5-class)
Which metric matters for Fine-grained sentiment (5-class) and WHY

In fine-grained sentiment analysis with 5 classes (e.g., very negative, negative, neutral, positive, very positive), accuracy is a common metric because it shows how often the model predicts the exact sentiment correctly.

However, accuracy alone can hide problems if some classes are rare. So, we also use macro-averaged precision, recall, and F1-score. These treat each class equally, helping us see if the model struggles with any specific sentiment.

For example, if the model often misses "very negative" reviews, recall for that class will be low, signaling a problem.

Confusion matrix example for 5-class sentiment
       Predicted
       VN   N   Neu  P   VP
    VN  40   5    3   1    1
    N    4  50    6   3    2
    Neu  2   7   60   5    6
    P    1   3    7  55    4
    VP   0   1    4   6   60

    VN = Very Negative
    N = Negative
    Neu = Neutral
    P = Positive
    VP = Very Positive
    

This matrix shows how many samples from each true class (rows) were predicted as each class (columns).

Precision vs Recall tradeoff with examples

Imagine the "very negative" class is important to catch because it signals urgent customer issues.

  • High precision for "very negative" means when the model says a review is very negative, it usually is. This avoids false alarms.
  • High recall means the model finds most very negative reviews, even if some are wrongly labeled.

If you want to quickly fix urgent problems, high recall is better to not miss any bad reviews.

But if you want to avoid bothering your team with false alarms, high precision is better.

What "good" vs "bad" metric values look like

Good model example:

  • Accuracy around 70% or higher (since 5 classes is harder than 2)
  • Macro F1-score above 0.65, showing balanced performance across classes
  • Precision and recall for each class above 0.6, especially for important classes like "very negative" and "very positive"

Bad model example:

  • Accuracy below 50%, close to random guessing (20% for 5 classes)
  • Macro F1-score below 0.4, meaning poor balance
  • Very low recall for some classes (e.g., 0.2 for "very negative"), meaning many missed cases
Common pitfalls in metrics for fine-grained sentiment
  • Accuracy paradox: High accuracy can hide poor performance on rare classes if the dataset is imbalanced.
  • Ignoring class imbalance: Some sentiments may be rare but important; metrics must reflect this.
  • Data leakage: If test data leaks into training, metrics will be unrealistically high.
  • Overfitting: Very high training accuracy but low test accuracy means the model memorizes instead of generalizing.
Self-check question

Your fine-grained sentiment model has 98% accuracy but only 12% recall on the "very negative" class. Is it good for production? Why or why not?

Answer: No, it is not good. The very low recall on "very negative" means the model misses most very negative reviews. Even though overall accuracy is high, the model fails to catch important negative feedback, which could harm customer satisfaction.

Key Result
Use accuracy plus macro-averaged precision, recall, and F1 to fairly evaluate all 5 sentiment classes, especially rare but important ones.