0
0
NLPml~8 mins

Why text classification categorizes documents in NLP - Why Metrics Matter

Choose your learning style9 modes available
Metrics & Evaluation - Why text classification categorizes documents
Which metric matters for this concept and WHY

For text classification, accuracy shows overall correct predictions. But because some categories may be rare, precision and recall are very important.

Precision tells us how many documents labeled as a category truly belong there. This avoids false alarms.

Recall tells us how many documents of a category were found by the model. This avoids missing important documents.

F1 score balances precision and recall, giving a single number to compare models.

Confusion matrix example
      Actual \ Predicted | Sports | Politics | Tech | Total
      ---------------------------------------------------
      Sports            |  50    |   5      |  5   | 60
      Politics          |  3     |  45      |  2   | 50
      Tech              |  4     |   3      |  33  | 40
      ---------------------------------------------------
      Total             |  57    |  53      |  40  | 150
    

From this, we calculate metrics per category. For example, for Sports:

  • Precision = TP / (TP + FP) = 50 / (50 + 5 + 4) = 50 / 59 ≈ 0.847
  • Recall = TP / (TP + FN) = 50 / (50 + 5 + 5) = 50 / 60 ≈ 0.833
Precision vs Recall tradeoff with examples

If you want to avoid wrongly labeling documents (false positives), focus on high precision. For example, in legal document sorting, wrongly labeling a contract as a lawsuit is bad.

If you want to find all documents of a category (avoid false negatives), focus on high recall. For example, in spam detection, missing spam emails is worse than wrongly marking some good emails.

Balancing both with F1 score helps when both errors matter.

What "good" vs "bad" metric values look like

Good: Precision and recall above 0.85 means the model correctly finds and labels most documents with few mistakes.

Bad: Precision or recall below 0.5 means many documents are mislabeled or missed, making the model unreliable.

Accuracy alone can be misleading if categories are unbalanced.

Common pitfalls in metrics
  • Accuracy paradox: High accuracy but poor recall on rare categories.
  • Data leakage: When test data leaks into training, metrics look better but model fails in real use.
  • Overfitting: Very high training metrics but low test metrics means model memorizes instead of learning.
Self-check question

Your text classification model has 98% accuracy but only 12% recall on the "urgent" category. Is it good for production?

Answer: No. The model misses 88% of urgent documents, which is risky. High accuracy is misleading because "urgent" documents are rare but important. You should improve recall before using it.

Key Result
Precision and recall are key to evaluate text classification because they show how well the model finds and labels each document category.