When controlling vocabulary size in NLP models, the key metrics are model accuracy and out-of-vocabulary (OOV) rate. Accuracy shows how well the model understands text with the chosen vocabulary. OOV rate tells us how many words in new text are missing from the vocabulary. A smaller vocabulary reduces model size and speeds up training but can increase OOV rate, hurting accuracy. So, balancing these metrics helps find the best vocabulary size.
Vocabulary size control in NLP - Model Metrics & Evaluation
Suppose we classify text into positive or negative sentiment.
Vocabulary size: 5,000 words
Total samples: 100
Confusion Matrix:
Predicted Positive | Predicted Negative
------------------------------------------
Actual Positive | 40 (TP) | 10 (FN)
Actual Negative | 5 (FP) | 45 (TN)
Precision = TP / (TP + FP) = 40 / (40 + 5) = 0.89
Recall = TP / (TP + FN) = 40 / (40 + 10) = 0.80
F1 Score = 2 * (0.89 * 0.80) / (0.89 + 0.80) = 0.84
If vocabulary size shrinks to 2,000, OOV words increase, causing more errors:
Confusion Matrix:
Predicted Positive | Predicted Negative
------------------------------------------
Actual Positive | 30 (TP) | 20 (FN)
Actual Negative | 10 (FP) | 40 (TN)
Precision = 30 / (30 + 10) = 0.75
Recall = 30 / (30 + 20) = 0.60
F1 Score = 2 * (0.75 * 0.60) / (0.75 + 0.60) = 0.67
Imagine packing a suitcase for a trip. A big suitcase (large vocabulary) lets you bring many clothes (words), so you are ready for anything (better accuracy). But it is heavy and slow to carry (larger model, slower training).
A small suitcase (small vocabulary) is light and fast but may miss important clothes (words), so you might feel unprepared (higher OOV, lower accuracy).
In NLP, choosing vocabulary size balances model speed and memory against understanding new text well.
- Good: Low OOV rate (under 5%), high accuracy (above 85%), balanced precision and recall.
- Bad: High OOV rate (above 15%), low accuracy (below 70%), large gap between precision and recall indicating poor generalization.
Good values mean the vocabulary covers most words the model sees, helping it predict well. Bad values mean many words are unknown, causing mistakes.
- Ignoring OOV rate: High accuracy on training data can hide poor performance on new text with many unknown words.
- Overfitting vocabulary: Using too large vocabulary may memorize training words but fail on new words.
- Data leakage: Including test words in vocabulary inflates accuracy falsely.
- Accuracy paradox: High accuracy with small vocabulary may happen if data is unbalanced, but model misses rare words.
Your NLP model has 98% accuracy but a 20% OOV rate on new text. Is it good for production? Why or why not?
Answer: No, because a 20% OOV rate means many words are unknown to the model. Even with high accuracy on known words, the model will struggle with new or rare words, reducing real-world performance. You should reduce OOV by increasing vocabulary or using subword methods.