0
0
NLPml~8 mins

Tokenization (word and sentence) in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Tokenization (word and sentence)
Which metric matters for Tokenization and WHY

Tokenization breaks text into pieces like words or sentences. The key metric is tokenization accuracy, which measures how many tokens are correctly identified compared to a trusted reference. This matters because wrong tokens can confuse later steps like understanding or translation.

Confusion matrix for Tokenization
          | Predicted Token | Predicted No Token
    ---------------------------------------------
    Actual Token    |       TP       |        FN
    Actual No Token |       FP       |        TN

    TP = correctly found tokens
    FP = wrongly added tokens
    FN = missed tokens
    TN = correctly ignored non-tokens
    
Tradeoff: Precision vs Recall in Tokenization

Precision means how many predicted tokens are actually correct. High precision means fewer wrong splits.

Recall means how many real tokens were found. High recall means fewer missed tokens.

For example, in sentence tokenization, missing a sentence end lowers recall. Adding extra breaks lowers precision.

Good tokenization balances both. Too many splits (high recall, low precision) confuse meaning. Too few splits (high precision, low recall) lose details.

Good vs Bad Metric Values for Tokenization
  • Good: Precision and recall above 95% means tokens match well with true text parts.
  • Bad: Precision below 70% means many wrong tokens added.
  • Bad: Recall below 70% means many tokens missed.
  • F1 score near 1.0 means balanced and accurate tokenization.
Common Pitfalls in Tokenization Metrics
  • Ignoring punctuation: Some tokenizers split on punctuation differently, causing metric mismatches.
  • Data leakage: Testing on data used for tokenizer tuning inflates scores.
  • Overfitting: Tokenizer tuned too much on one text style may fail on others.
  • Accuracy paradox: Overall accuracy can be misleading if many tokens are easy to find but some critical tokens are missed.
Self Check

Your tokenizer has 98% accuracy but 12% recall on sentence breaks. Is it good?

Answer: No. The low recall means it misses most sentence boundaries. This harms tasks needing sentence understanding, even if overall accuracy seems high.

Key Result
Tokenization quality is best judged by balanced precision and recall above 95%, ensuring tokens match true text parts accurately.