For tokenization and vocabulary, the key metrics are token coverage and out-of-vocabulary (OOV) rate. Token coverage measures how well the vocabulary represents the input text. A high coverage means most words or subwords are recognized by the model. The OOV rate shows how many words are not in the vocabulary, which can cause the model to struggle understanding or generating text. These metrics matter because good tokenization helps the model learn and predict better by breaking text into meaningful pieces it knows.
Tokenization and vocabulary in Prompt Engineering / GenAI - Model Metrics & Evaluation
Vocabulary Size: 10,000 tokens Example: Total words in text: 1000 Known tokens (in vocabulary): 950 Unknown tokens (OOV): 50 Token Coverage = 950 / 1000 = 95% OOV Rate = 50 / 1000 = 5% This simple count shows how many tokens the model can handle well versus unknown ones.
In tokenization, think of precision as how accurately tokens represent real words or meaningful parts, and recall as how many real words are captured by the vocabulary.
Example 1: High precision, low recall
Vocabulary has very specific tokens, so each token is very meaningful (high precision). But many words are missing, so many tokens are unknown (low recall). This can confuse the model on new text.
Example 2: High recall, low precision
Vocabulary includes many tokens, even rare or noisy ones. Most words are covered (high recall), but some tokens are too small or meaningless (low precision). This can make the model slower and less clear.
The goal is to balance token coverage (recall) and meaningful tokens (precision) for best model understanding.
- Good: Token coverage above 95%, OOV rate below 5%. Vocabulary size balanced to cover most words without too many rare tokens.
- Bad: Token coverage below 80%, OOV rate above 20%. Many unknown tokens cause poor model understanding and errors.
- Too large vocabulary can slow training and increase memory use without big gains.
- Too small vocabulary leads to many unknown tokens and poor text representation.
- Ignoring OOV rate: High accuracy on training data can hide many unknown tokens in new text, causing poor real-world performance.
- Overfitting vocabulary: Vocabulary too tuned to training data may not generalize to new words or languages.
- Data leakage: Including test data words in vocabulary inflates coverage and misleads evaluation.
- Ignoring token granularity: Very small tokens (like single letters) increase coverage but reduce meaningfulness.
Your tokenizer has 98% token coverage but a vocabulary size of 100,000 tokens. Is this good? Why or why not?
Answer: While 98% coverage is high, a vocabulary of 100,000 tokens is very large and may slow down the model and require more memory. It might include many rare or unnecessary tokens. A smaller vocabulary with slightly lower coverage (e.g., 95%) could be more efficient and still effective. So, this setup might not be ideal for practical use.