0
0
NLPml~8 mins

Tokenization in spaCy in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Tokenization in spaCy
Which metric matters for Tokenization in spaCy and WHY

Tokenization splits text into words or pieces. The key metric is Tokenization Accuracy. It measures how many tokens the model splits correctly compared to a trusted standard. High accuracy means the text is split just right, which helps later steps like understanding meaning or finding keywords.

Confusion matrix for Tokenization
      | Predicted Token | Predicted No Token |
      |-----------------|--------------------|
      | True Token (TP)  | Missed Token (FN)  |
      | False Token (FP) | True No Token (TN) |

      TP: Correctly identified tokens
      FP: Incorrectly added tokens
      FN: Tokens missed by tokenizer
      TN: Correctly identified non-token boundaries
    

Example: If the tokenizer splits "don't" into "do" and "n't" correctly, it counts as TP. If it misses splitting, that is FN.

Tradeoff: Precision vs Recall in Tokenization

Precision means how many tokens the tokenizer predicted are actually correct. High precision means fewer wrong splits.

Recall means how many true tokens the tokenizer found out of all real tokens. High recall means fewer missed splits.

Example: If tokenizer splits too much, precision drops (more wrong tokens). If it splits too little, recall drops (misses tokens).

Good tokenization balances precision and recall to avoid both missing and adding wrong tokens.

Good vs Bad Metric Values for Tokenization
  • Good: Precision and Recall above 95%. Tokenization matches human standard closely.
  • Bad: Precision or Recall below 80%. Many tokens are wrong or missed, causing errors in later text analysis.
  • Example: Precision 98%, Recall 97% means tokenizer is very reliable.
  • Example: Precision 70%, Recall 60% means tokenizer often splits wrongly or misses tokens.
Common Pitfalls in Tokenization Metrics
  • Ignoring context: Some tokens depend on language rules or abbreviations. Simple metrics may miss these nuances.
  • Data leakage: Testing tokenizer on data it was trained on can give too optimistic accuracy.
  • Overfitting: Tokenizer tuned too much on one text type may fail on others.
  • Accuracy paradox: High overall accuracy can hide poor token splits if many tokens are easy.
Self Check

Your tokenizer has 98% accuracy but 12% recall on splitting contractions like "don't". Is it good?

Answer: No. Even with high overall accuracy, very low recall on contractions means many tokens are missed. This hurts understanding and downstream tasks. You should improve recall on these cases.

Key Result
Tokenization accuracy depends on balancing precision and recall to correctly split text into tokens without missing or adding wrong pieces.