0
0
PyTorchml~8 mins

Sequence classification in PyTorch - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Sequence classification
Which metric matters for Sequence Classification and WHY

For sequence classification, common metrics include accuracy, precision, recall, and F1 score. Accuracy tells us how many sequences were correctly labeled overall. Precision shows how many predicted positive sequences were actually positive. Recall tells us how many actual positive sequences were found by the model. F1 score balances precision and recall, which is important when classes are uneven or mistakes have different costs.

Choosing the right metric depends on the task. For example, if missing a positive sequence is costly (like detecting spam or disease), recall is more important. If false alarms are costly, precision matters more.

Confusion Matrix for Sequence Classification
      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP)  | False Positive (FP) |
      | False Negative (FN) | True Negative (TN)  |

      Example:
      TP = 40, FP = 10, TN = 45, FN = 5

      Total samples = TP + FP + TN + FN = 40 + 10 + 45 + 5 = 100
    

From this matrix, we calculate:

  • Precision = TP / (TP + FP) = 40 / (40 + 10) = 0.8
  • Recall = TP / (TP + FN) = 40 / (40 + 5) = 0.89
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.84
  • Accuracy = (TP + TN) / Total = (40 + 45) / 100 = 0.85
Precision vs Recall Tradeoff with Examples

Imagine a model that classifies sequences as "spam" or "not spam" emails.

  • High Precision, Low Recall: The model flags only very sure spam emails. Few false alarms, but it misses many spam emails. Good if you hate false spam labels.
  • High Recall, Low Precision: The model flags almost all spam emails but also many good emails by mistake. Good if you want to catch all spam but can tolerate some false alarms.

Choosing depends on what is worse: missing spam or wrongly marking good emails as spam.

What "Good" vs "Bad" Metric Values Look Like

For sequence classification:

  • Good: Accuracy above 85%, Precision and Recall above 80%, balanced F1 score above 0.8.
  • Bad: Accuracy near random chance (e.g., 50% for two classes), very low precision or recall (below 50%), or very unbalanced metrics (e.g., 90% precision but 10% recall).

Balanced metrics mean the model is reliable both in finding positives and avoiding false alarms.

Common Pitfalls in Metrics for Sequence Classification
  • Accuracy Paradox: High accuracy can be misleading if classes are imbalanced. For example, if 95% sequences are negative, a model always predicting negative gets 95% accuracy but is useless.
  • Data Leakage: If training data leaks into test data, metrics look unrealistically high.
  • Overfitting Indicators: Very high training accuracy but low test accuracy means the model memorizes training sequences but fails on new ones.
  • Ignoring Class Imbalance: Not using precision, recall, or F1 when classes are uneven can hide poor performance on minority classes.
Self-Check Question

Your sequence classification model has 98% accuracy but only 12% recall on the positive class (e.g., detecting spam). Is this model good for production? Why or why not?

Answer: No, it is not good. The high accuracy is likely because most sequences are negative, so the model predicts negative most of the time. The very low recall means it misses almost all positive sequences, which defeats the purpose of detecting them.

Key Result
For sequence classification, balanced precision, recall, and F1 score matter most to ensure the model finds positives without many false alarms.