0
0
NLPml~8 mins

Model selection for tasks in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Model selection for tasks
Which metric matters for model selection and WHY

Choosing the right metric depends on the task you want your model to do. For example, if you want to classify emails as spam or not, precision matters because you want to avoid marking good emails as spam. If you want to find all spam emails, recall is important. For tasks like language translation or text generation, metrics like BLEU or ROUGE measure how close the output is to human language. Always pick a metric that matches what you care about in your task.

Confusion matrix example for classification tasks
      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP)  | False Positive (FP) |
      | False Negative (FN) | True Negative (TN)  |

    Example:
    TP = 70, FP = 10, TN = 900, FN = 20
    Total samples = 70 + 10 + 900 + 20 = 1000
    

From this, you can calculate precision = 70 / (70 + 10) = 0.875 and recall = 70 / (70 + 20) = 0.778.

Precision vs Recall tradeoff with examples

Imagine a model that detects cancer from scans. Missing a cancer case (low recall) is very bad because the patient won't get treatment. So, recall is more important here. But if the model marks many healthy people as cancer (low precision), it causes stress and extra tests.

In contrast, a spam filter should have high precision to avoid losing important emails, even if it misses some spam (lower recall).

Choosing the right balance depends on what mistakes cost more in your task.

What good vs bad metric values look like

For a spam filter:

  • Good: Precision > 0.9, Recall around 0.7 (few false spam)
  • Bad: Precision < 0.5 (many good emails marked spam)

For a cancer detector:

  • Good: Recall > 0.9 (few missed cancers)
  • Bad: Recall < 0.5 (many cancers missed)

Accuracy alone can be misleading if classes are imbalanced.

Common pitfalls in metrics
  • Accuracy paradox: High accuracy can happen if one class dominates, but the model ignores the minority class.
  • Data leakage: When test data leaks into training, metrics look too good but model fails in real life.
  • Overfitting: Model performs great on training but poorly on new data, metrics differ greatly between train and test.
  • Wrong metric choice: Using accuracy for imbalanced data or BLEU for tasks it doesn't fit.
Self-check question

Your model has 98% accuracy but only 12% recall on fraud cases. Is it good for production? Why or why not?

Answer: No, it is not good. The model misses 88% of fraud cases (low recall), which is dangerous because fraud goes undetected. High accuracy is misleading here because fraud is rare, so the model mostly predicts non-fraud correctly but fails at the important task.

Key Result
Choosing the right metric depends on the task; precision and recall tradeoffs are key for model selection.