Sometimes, a model sees words it never learned before. Handling these words helps the model understand new or rare words without getting confused.
Handling out-of-vocabulary words in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
1. Use a special token like <UNK> for unknown words. 2. Replace words not in vocabulary with <UNK> during preprocessing. 3. Use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece. 4. Use character-level models that read words letter by letter.
The <UNK> token stands for 'unknown' and helps the model handle unseen words.
Subword methods break words into smaller parts, so even new words can be understood from known pieces.
Examples
NLP
sentence = "I love quokka" vocab = {"I", "love"} processed = [word if word in vocab else "<UNK>" for word in sentence.split()]
NLP
from tokenizers import ByteLevelBPETokenizer # Train or load a BPE tokenizer # It splits unknown words into known subwords
NLP
def char_level_tokenize(word): return list(word) # Model reads each letter, so unknown words are handled naturally
Sample Model
This simple program shows how to replace words not in the vocabulary with <UNK> so the model can handle them.
NLP
vocab = {"hello", "world", "I", "love"}
sentence = "I love quokka"
# Replace unknown words with <UNK>
processed = [word if word in vocab else "<UNK>" for word in sentence.split()]
print("Original sentence:", sentence)
print("Processed sentence:", " ".join(processed))Important Notes
Always include the <UNK> token in your vocabulary when training models.
Subword tokenization is more flexible and often better than just using <UNK>.
Character-level models can be slower but handle any word without unknown tokens.
Summary
Out-of-vocabulary words are words the model hasn't seen before.
Use <UNK> tokens or subword methods to handle them.
This helps models work better with new or rare words.
Practice
1. What is the main purpose of using an
<UNK> token in natural language processing?easy
Solution
Step 1: Understand the role of
The<UNK>token<UNK>token is used to replace words that the model has not seen during training, known as out-of-vocabulary words.Step 2: Identify the correct purpose
Since<UNK>stands for unknown words, it helps the model handle new or rare words by treating them as a single token.Final Answer:
To represent words not seen during training -> Option CQuick Check:
<UNK>= unknown words [OK]
Hint: Think of
<UNK> as a placeholder for unknown words [OK]Common Mistakes:
- Confusing
<UNK>with sentence delimiters - Using
<UNK>for common words - Thinking
<UNK>highlights keywords
2. Which of the following is the correct way to replace out-of-vocabulary words with
<UNK> in a Python list of tokens named tokens given a vocabulary set vocab?easy
Solution
Step 1: Understand list comprehension syntax
The correct Python syntax for conditional expressions inside a list comprehension is:[x if condition else y for x in list].Step 2: Apply correct condition for replacing OOV words
We want to keep the word if it is in the vocabulary; otherwise, replace it with'<UNK>'. tokens = [word if word in vocab else '<UNK>' for word in tokens] correctly implements this logic.Final Answer:
tokens = [word if word in vocab else '<UNK>' for word in tokens] -> Option AQuick Check:
Correct Python conditional list comprehension [OK]
Hint: Remember:
x if condition else y inside list comprehensions [OK]Common Mistakes:
- Using incorrect syntax like 'if-else' outside list comprehension
- Confusing Python with other languages' ternary syntax
- Reversing the condition logic
3. Given the following code snippet, what will be the output?
vocab = {'hello', 'world'}
tokens = ['hello', 'there', 'world']
tokens = [word if word in vocab else '<UNK>' for word in tokens]
print(tokens)medium
Solution
Step 1: Check each token against the vocabulary
'hello' is in vocab, so it stays 'hello'. 'there' is not in vocab, so it becomes '<UNK>'. 'world' is in vocab, so it stays 'world'.Step 2: Construct the resulting list
The new tokens list is ['hello', '<UNK>', 'world'].Final Answer:
['hello', '<UNK>', 'world'] -> Option DQuick Check:
Replace OOV words with<UNK>[OK]
Hint: Replace words not in vocab with
<UNK> [OK]Common Mistakes:
- Not replacing 'there' because of misunderstanding
- Replacing all words regardless of vocab
- Confusing list order in output
4. The following code is intended to replace out-of-vocabulary words with
<UNK>. What is the error?
vocab = {'cat', 'dog'}
tokens = ['cat', 'bird', 'dog']
tokens = [word if word not in vocab else '<UNK>' for word in tokens]
print(tokens)medium
Solution
Step 1: Analyze the condition in list comprehension
The conditionword if word not in vocab else '<UNK>'means words NOT in vocab stay as they are, and words IN vocab become '<UNK>'. This is the opposite of the intended behavior.Step 2: Identify the correct logic
We want to keep words in vocab and replace words not in vocab with '<UNK>'. So the condition should beword if word in vocab else '<UNK>'.Final Answer:
The condition is reversed; it replaces in-vocab words instead of OOV -> Option BQuick Check:
Correct condition keeps vocab words, replaces others [OK]
Hint: Check if condition matches intended keep-or-replace logic [OK]
Common Mistakes:
- Mixing up 'in' and 'not in' in conditions
- Assuming set vs list affects membership test
- Ignoring Python 3 print syntax
5. You have a pretrained word embedding model that does not include the word 'unicorn'. Which approach best helps your model handle this out-of-vocabulary word during inference?
hard
Solution
Step 1: Understand limitations of
Replacing with<UNK>token<UNK>loses specific meaning, which may reduce model accuracy.Step 2: Consider subword tokenization benefits
Subword tokenization breaks unknown words into smaller known units, allowing the model to infer meaning from parts.Step 3: Evaluate other options
Ignoring the word loses information; adding it without retraining is not feasible; subword tokenization is the best practical approach.Final Answer:
Use subword tokenization to break 'unicorn' into known parts -> Option AQuick Check:
Subword methods handle OOV words better than<UNK>[OK]
Hint: Break unknown words into smaller known pieces with subword tokenization [OK]
Common Mistakes:
- Thinking
<UNK>always preserves meaning - Trying to add words without retraining embeddings
- Removing unknown words loses important info
