NLPml~8 mins

Handling out-of-vocabulary words in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Handling out-of-vocabulary words

Which metric matters for Handling out-of-vocabulary words and WHY

When dealing with words not seen during training, the key metric is coverage -- how many words in new data are recognized by the model's vocabulary. Low coverage means many words are unknown, which can hurt predictions.

Besides coverage, accuracy or F1 score on tasks like text classification or named entity recognition show how well the model handles unknown words indirectly.

We also look at embedding quality for unknown words, often measured by downstream task performance or similarity scores.

Confusion matrix example for OOV handling

Imagine a text classification task where unknown words cause errors. Here is a confusion matrix for 100 samples:

      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP) = 40 | False Negative (FN) = 10 |
      | False Positive (FP) = 5  | True Negative (TN) = 45 |

Precision = 40 / (40 + 5) = 0.89

Recall = 40 / (40 + 10) = 0.80

F1 Score = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84

If many errors come from unknown words, improving OOV handling should raise these numbers.

Tradeoff: Precision vs Recall in OOV word handling

When unknown words appear, the model might guess their meaning or ignore them.

High precision means the model is careful and only predicts positive when confident, reducing false alarms caused by unknown words.

High recall means the model tries to catch all positives, even if some unknown words cause mistakes.

For example, in spam detection, if unknown words appear, high precision avoids marking good emails as spam.

In medical text, high recall is better to catch all important mentions, even if some unknown words cause false alarms.

Good vs Bad metric values for OOV handling

Good: Coverage above 95%, F1 score above 0.85 on test data with unknown words, low error rate on sentences containing OOV words.

Bad: Coverage below 80%, F1 score below 0.70, many misclassifications linked to unknown words, showing the model struggles to understand new words.

Common pitfalls in metrics for OOV handling

Ignoring coverage: High accuracy can hide poor handling of unknown words if test data has few OOV words.
Data leakage: If test data contains words seen in training, OOV impact is underestimated.
Overfitting: Model memorizes training words but fails on new words, causing poor generalization.
Confusing precision and recall: Misinterpreting which errors matter more depending on application.

Self-check question

Your text classifier has 98% accuracy but only 12% recall on sentences containing unknown words. Is it good for production?

Answer: No. The low recall means the model misses most positive cases when unknown words appear. Despite high overall accuracy, it fails to handle OOV words well, which can cause serious errors in real use.

Key Result

Coverage and F1 score are key to evaluate how well a model handles out-of-vocabulary words.