NLPml~15 mins

Handling out-of-vocabulary words in NLP - Deep Dive

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - Handling out-of-vocabulary words

What is it?

Handling out-of-vocabulary (OOV) words means dealing with words that a language model or system has never seen before during training. These words can cause problems because the model doesn't know their meaning or how to process them. Techniques to handle OOV words help models understand or guess the meaning of new words so they can still work well. This is important for making language tools flexible and useful in real life.

Why it matters

Without handling OOV words, language models would fail or give wrong answers whenever they meet new words, which happens often because language is always changing. For example, new slang, names, or technical terms appear all the time. If models ignore or mishandle these, users get poor results, making tools like translators, chatbots, or search engines less helpful. Handling OOV words keeps language AI useful and accurate in the real world.

Where it fits

Before learning about handling OOV words, you should understand basic natural language processing concepts like tokenization and word embeddings. After this, you can explore advanced topics like subword models, contextual embeddings, and transfer learning that further improve how models deal with language variability.

Mental Model

Core Idea

Handling out-of-vocabulary words means finding smart ways to understand or represent words a model has never seen before so it can still make good predictions.

Think of it like...

It's like meeting a new person with a name you've never heard before; instead of ignoring them, you try to guess their personality from their name's parts or context.

┌───────────────────────────────┐
│       Input Sentence           │
│ "I love my new quizzlet app" │
└──────────────┬────────────────┘
               │
       ┌───────▼────────┐
       │ Tokenization    │
       │ [I, love, my,   │
       │ new, quizzlet,  │
       │ app]            │
       └───────┬────────┘
               │
       ┌───────▼─────────────┐
       │ Check Vocabulary     │
       │ quizzlet ? OOV word  │
       └───────┬─────────────┘
               │
   ┌───────────▼─────────────┐
   │ Handle OOV Word          │
   │ (e.g., subword split,   │
   │  embedding fallback)    │
   └───────────┬─────────────┘
               │
       ┌───────▼────────┐
       │ Model Output   │
       │ Prediction or  │
       │ Understanding │
       └───────────────┘

Build-Up - 7 Steps

FoundationWhat are out-of-vocabulary words

Concept: Introduce the idea of words not seen during model training.

When a language model learns, it builds a list of words it knows, called a vocabulary. Any word not in this list is called an out-of-vocabulary (OOV) word. For example, if the model never saw the word 'quizzlet' during training, it won't know what it means or how to handle it.

Result

You understand that OOV words are unknown words that can confuse language models.

Knowing what OOV words are helps you see why language models sometimes fail or give wrong answers.

FoundationWhy OOV words cause problems

IntermediateSimple OOV handling: Unknown token

IntermediateSubword tokenization for OOV words

IntermediateCharacter-level embeddings for OOV words

AdvancedContextual embeddings reduce OOV impact

ExpertTradeoffs and surprises in OOV handling

Under the Hood

At the core, language models convert words into numbers called embeddings. When a word is OOV, the model cannot find its embedding in the vocabulary. Subword tokenization breaks the word into smaller known units, each with embeddings, which are combined to represent the whole word. Character-level embeddings build word representations from individual letters using neural networks. Contextual models generate embeddings dynamically based on surrounding words, allowing flexible understanding of new words.

Why designed this way?

Early models used fixed vocabularies for simplicity and speed but struggled with OOV words. Subword and character methods emerged to balance vocabulary size and coverage, allowing models to handle new words without exploding vocabulary size. Contextual embeddings were designed to capture meaning dynamically, reducing reliance on fixed vocabularies and improving understanding of rare or new words.

┌───────────────┐
│ Input Word    │
│ (e.g., OOV)   │
└───────┬───────┘
        │
┌───────▼─────────────┐
│ Vocabulary Lookup    │
│ Found?               │
└───────┬───────┬─────┘
        │       │
       Yes      No
        │       │
┌───────▼───┐  ┌───────────────┐
│ Use Word  │  │ Subword Tokenize│
│ Embedding │  │ or Char Embed  │
└───────┬───┘  └───────┬─────────┘
        │              │
        └──────┬───────┘
               │
        ┌──────▼───────┐
        │ Combine Parts │
        │ Embeddings    │
        └──────┬───────┘
               │
        ┌──────▼───────┐
        │ Contextual   │
        │ Embedding    │
        └──────┬───────┘
               │
        ┌──────▼───────┐
        │ Model Output │
        └──────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does replacing all unknown words with keep sentence meaning fully intact? Commit to yes or no.

Common Belief:Replacing unknown words with a single token is enough to handle all OOV problems.

Tap to reveal reality

Quick: Do subword tokenization methods always split words perfectly? Commit to yes or no.

Common Belief:Subword tokenization always splits words correctly and improves understanding.

Tap to reveal reality

Quick: Does having a huge vocabulary completely solve OOV problems? Commit to yes or no.

Common Belief:If the vocabulary is large enough, there will be no OOV words.

Tap to reveal reality

Quick: Do contextual embeddings always perfectly understand new words? Commit to yes or no.

Common Belief:Contextual embeddings always solve OOV issues perfectly by using sentence context.

Tap to reveal reality

Expert Zone

Subword tokenization algorithms like BPE balance vocabulary size and coverage but require careful tuning to avoid over-splitting or under-splitting.

Character-level embeddings add robustness but increase computational cost and may struggle with very long or complex words.

Contextual embeddings reduce OOV impact but depend heavily on training data quality and can still fail on truly novel or domain-specific terms.

When NOT to use

Handling OOV words with subword or character methods may not be ideal for languages with complex morphology or when exact word forms are critical, such as legal or medical texts. In such cases, domain-specific vocabularies or hybrid approaches combining dictionaries and embeddings are better.

Production Patterns

In production, models often combine subword tokenization with contextual embeddings to balance flexibility and accuracy. Systems may also update vocabularies periodically or use fallback dictionaries for domain-specific terms. Monitoring OOV rates helps maintain model performance over time.

Connections

Data Compression

Subword tokenization uses similar principles to data compression by breaking data into frequent parts.

Understanding how data compression finds common patterns helps grasp why subword tokenization efficiently represents language.

Human Language Learning

Humans learn new words by breaking them into familiar parts and using context, similar to OOV handling in models.

Knowing how people guess meanings of new words from roots and context deepens understanding of model strategies.

Error Correction in Communication

Handling OOV words is like correcting errors or unknown signals in communication systems to maintain message meaning.

Seeing OOV handling as error correction highlights the importance of robustness and fallback methods in language AI.

Common Pitfalls

#1Ignoring OOV words by removing them from input.

Wrong approach:sentence = 'I love my new quizzlet app' tokens = [w for w in sentence.split() if w in vocabulary] # 'quizzlet' removed silently

Correct approach:sentence = 'I love my new quizzlet app' tokens = [w if w in vocabulary else '' for w in sentence.split()]

Root cause:Misunderstanding that removing unknown words loses important information and harms model understanding.

#2Using a fixed large vocabulary without subword methods.

Wrong approach:vocabulary = load_large_vocab() # Model fails on truly new words not in this huge list

Correct approach:vocabulary = load_subword_vocab() # Model can handle new words by combining subwords

Root cause:Believing bigger vocabulary alone solves OOV, ignoring efficiency and coverage tradeoffs.

#3Splitting words incorrectly with naive tokenization.

Wrong approach:tokens = sentence.split(' ') # 'quizzlet' treated as one unknown token without subword split

Correct approach:tokens = subword_tokenizer.tokenize(sentence) # 'quizzlet' split into known subwords

Root cause:Not applying advanced tokenization methods that improve OOV handling.

Key Takeaways

Out-of-vocabulary words are unknown words that language models have not seen during training and can cause errors.

Simple replacement with an unknown token is easy but loses the unique meaning of new words.

Subword and character-level methods break words into smaller parts to better represent and understand new words.

Contextual embeddings use sentence context to dynamically create word meanings, reducing OOV problems.

Handling OOV words well requires balancing vocabulary size, model complexity, and real-world language variability.

Practice

(1/5)

1. What is the main purpose of using an <UNK> token in natural language processing?

easy

A. To separate words in a sentence

B. To mark the end of a sentence

C. To represent words not seen during training

D. To highlight important keywords

Handling out-of-vocabulary words in NLP - Deep Dive

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of `<UNK>` token

Step 2: Identify the correct purpose

Final Answer:

Quick Check:

Solution

Step 1: Understand list comprehension syntax

Step 2: Apply correct condition for replacing OOV words

Final Answer:

Quick Check:

Solution

Step 1: Check each token against the vocabulary

Step 2: Construct the resulting list

Final Answer:

Quick Check:

Solution

Step 1: Analyze the condition in list comprehension

Step 2: Identify the correct logic

Final Answer:

Quick Check:

Solution

Step 1: Understand limitations of `<UNK>` token

Step 2: Consider subword tokenization benefits

Step 3: Evaluate other options

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of <UNK> token

Step 2: Identify the correct purpose

Final Answer:

Quick Check:

Solution

Step 1: Understand list comprehension syntax

Step 2: Apply correct condition for replacing OOV words

Final Answer:

Quick Check:

Solution

Step 1: Check each token against the vocabulary

Step 2: Construct the resulting list

Final Answer:

Quick Check:

Solution

Step 1: Analyze the condition in list comprehension

Step 2: Identify the correct logic

Final Answer:

Quick Check:

Solution

Step 1: Understand limitations of <UNK> token

Step 2: Consider subword tokenization benefits

Step 3: Evaluate other options

Final Answer:

Quick Check:

Step 1: Understand the role of `<UNK>` token

Step 1: Understand limitations of `<UNK>` token