Tokenization splits text into words or pieces. The key metric is Tokenization Accuracy. It measures how many tokens the model splits correctly compared to a trusted standard. High accuracy means the text is split just right, which helps later steps like understanding meaning or finding keywords.
Tokenization in spaCy in NLP - Model Metrics & Evaluation
| Predicted Token | Predicted No Token |
|-----------------|--------------------|
| True Token (TP) | Missed Token (FN) |
| False Token (FP) | True No Token (TN) |
TP: Correctly identified tokens
FP: Incorrectly added tokens
FN: Tokens missed by tokenizer
TN: Correctly identified non-token boundaries
Example: If the tokenizer splits "don't" into "do" and "n't" correctly, it counts as TP. If it misses splitting, that is FN.
Precision means how many tokens the tokenizer predicted are actually correct. High precision means fewer wrong splits.
Recall means how many true tokens the tokenizer found out of all real tokens. High recall means fewer missed splits.
Example: If tokenizer splits too much, precision drops (more wrong tokens). If it splits too little, recall drops (misses tokens).
Good tokenization balances precision and recall to avoid both missing and adding wrong tokens.
- Good: Precision and Recall above 95%. Tokenization matches human standard closely.
- Bad: Precision or Recall below 80%. Many tokens are wrong or missed, causing errors in later text analysis.
- Example: Precision 98%, Recall 97% means tokenizer is very reliable.
- Example: Precision 70%, Recall 60% means tokenizer often splits wrongly or misses tokens.
- Ignoring context: Some tokens depend on language rules or abbreviations. Simple metrics may miss these nuances.
- Data leakage: Testing tokenizer on data it was trained on can give too optimistic accuracy.
- Overfitting: Tokenizer tuned too much on one text type may fail on others.
- Accuracy paradox: High overall accuracy can hide poor token splits if many tokens are easy.
Your tokenizer has 98% accuracy but 12% recall on splitting contractions like "don't". Is it good?
Answer: No. Even with high overall accuracy, very low recall on contractions means many tokens are missed. This hurts understanding and downstream tasks. You should improve recall on these cases.