ML Pythonml~8 mins

Sentiment analysis with scikit-learn in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Sentiment analysis with scikit-learn

Which metric matters for Sentiment analysis with scikit-learn and WHY

In sentiment analysis, we want to know how well the model can correctly identify positive and negative feelings in text. The key metrics are Precision, Recall, and F1-score.

Precision tells us how many of the texts the model labeled as positive (or negative) are actually correct. This is important if we want to avoid false alarms.

Recall tells us how many of the actual positive (or negative) texts the model found. This matters if missing a sentiment is costly.

F1-score balances precision and recall, giving a single number to understand overall performance.

Accuracy is also used but can be misleading if classes are imbalanced (e.g., many more positive than negative reviews).

Confusion matrix example

Suppose we have 100 reviews: 50 positive and 50 negative. The model predicts as follows:

      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP) = 40 | False Negative (FN) = 10 |
      | False Positive (FP) = 5 | True Negative (TN) = 45 |

Calculations:

Precision = TP / (TP + FP) = 40 / (40 + 5) = 0.89
Recall = TP / (TP + FN) = 40 / (40 + 10) = 0.80
F1-score = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84
Accuracy = (TP + TN) / Total = (40 + 45) / 100 = 0.85

Precision vs Recall tradeoff with examples

If we want to be very sure that positive reviews are truly positive, we focus on high precision. This means fewer false positives but might miss some positive reviews (lower recall).

If we want to catch as many positive reviews as possible, we focus on high recall. This means fewer false negatives but might include some false positives (lower precision).

For example, a company analyzing customer feedback might prefer high recall to not miss any unhappy customers. But if they want to send special offers only to truly happy customers, they might prefer high precision.

What good vs bad metric values look like

Good metrics:

Precision and recall above 0.80 show the model is reliable in identifying sentiments.
F1-score above 0.80 means a good balance.
Accuracy above 0.85 is good if classes are balanced.

Bad metrics:

Precision or recall below 0.50 means many errors in predictions.
F1-score below 0.60 shows poor overall performance.
High accuracy but low recall or precision indicates the model might be guessing the majority class.

Common pitfalls in metrics for sentiment analysis

Accuracy paradox: High accuracy can be misleading if one sentiment class dominates the data.
Data leakage: If test data leaks into training, metrics look better but model fails in real use.
Overfitting: Very high training accuracy but low test accuracy means the model memorizes training data, not generalizing well.
Ignoring class imbalance: Not using precision and recall can hide poor performance on minority classes.

Self-check question

Your sentiment analysis model has 98% accuracy but only 12% recall on negative reviews. Is it good for production? Why or why not?

Answer: No, it is not good. The model misses 88% of negative reviews (low recall), which means many negative sentiments are not detected. High accuracy is likely due to many positive reviews dominating the data. This can cause poor user experience if negative feedback is ignored.

Key Result

Precision, recall, and F1-score are key to evaluate sentiment analysis models, especially when classes are imbalanced.