Sometimes, a model sees words it never learned before. Handling these words helps the model understand new or rare words without getting confused.
0
0
Handling out-of-vocabulary words in NLP
Introduction
When a chatbot meets new slang or names it hasn't seen before.
When translating text with rare or new words.
When analyzing social media posts with typos or new terms.
When building a search engine that must understand new queries.
When training models on limited data but expecting new words later.
Syntax
NLP
1. Use a special token like <UNK> for unknown words. 2. Replace words not in vocabulary with <UNK> during preprocessing. 3. Use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece. 4. Use character-level models that read words letter by letter.
The <UNK> token stands for 'unknown' and helps the model handle unseen words.
Subword methods break words into smaller parts, so even new words can be understood from known pieces.
Examples
This replaces the unknown word 'quokka' with <UNK>.
NLP
sentence = "I love quokka" vocab = {"I", "love"} processed = [word if word in vocab else "<UNK>" for word in sentence.split()]
Subword tokenizers help break new words into smaller known parts.
NLP
from tokenizers import ByteLevelBPETokenizer # Train or load a BPE tokenizer # It splits unknown words into known subwords
Character-level tokenization avoids unknown words by reading letters.
NLP
def char_level_tokenize(word): return list(word) # Model reads each letter, so unknown words are handled naturally
Sample Model
This simple program shows how to replace words not in the vocabulary with <UNK> so the model can handle them.
NLP
vocab = {"hello", "world", "I", "love"}
sentence = "I love quokka"
# Replace unknown words with <UNK>
processed = [word if word in vocab else "<UNK>" for word in sentence.split()]
print("Original sentence:", sentence)
print("Processed sentence:", " ".join(processed))OutputSuccess
Important Notes
Always include the <UNK> token in your vocabulary when training models.
Subword tokenization is more flexible and often better than just using <UNK>.
Character-level models can be slower but handle any word without unknown tokens.
Summary
Out-of-vocabulary words are words the model hasn't seen before.
Use <UNK> tokens or subword methods to handle them.
This helps models work better with new or rare words.