0
0
NLPml~12 mins

Handling out-of-vocabulary words in NLP - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Handling out-of-vocabulary words

This pipeline shows how a text model handles words it has never seen before, called out-of-vocabulary (OOV) words. It transforms text into numbers, trains a simple model, and predicts while managing unknown words gracefully.

Data Flow - 6 Stages
1Raw Text Input
5 sentences x variable lengthCollect raw sentences with some unknown words5 sentences x variable length
"I love apples", "She eats bananas", "They like grapes", "We enjoy mangoes", "He hates durians"
2Tokenization
5 sentences x variable lengthSplit sentences into words5 sentences x variable length tokens
[['I', 'love', 'apples'], ['She', 'eats', 'bananas'], ['They', 'like', 'grapes'], ['We', 'enjoy', 'mangoes'], ['He', 'hates', 'durians']]
3Vocabulary Building
All tokens from training sentencesCreate a word-to-index map with a fixed vocabulary size and reserve an index for OOVVocabulary size = 7 (6 known words + 1 OOV token)
{"I":1, "love":2, "apples":3, "She":4, "eats":5, "bananas":6, "<OOV>":0}
4Text to Sequence Conversion
5 sentences x tokensReplace words with their indices; unknown words get OOV index 05 sentences x token indices
[[1, 2, 3], [4, 5, 6], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
5Padding Sequences
5 sentences x variable lengthPad sequences to max length 3 with zeros5 sentences x 3 tokens
[[1, 2, 3], [4, 5, 6], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
6Model Training
5 samples x 3 tokensTrain a simple neural network to classify sentencesModel weights updated
Training on sequences with OOV tokens handled as index 0
Training Trace - Epoch by Epoch

Epochs
1 |***************         | Loss 0.85
2 |********************    | Loss 0.65
3 |************************| Loss 0.45
EpochLoss ↓Accuracy ↑Observation
10.850.40Model starts learning, loss high, accuracy low
20.650.60Loss decreases, accuracy improves as model learns
30.450.80Model converges, good accuracy on training data
Prediction Trace - 4 Layers
Layer 1: Input Sentence
Layer 2: Text to Sequence
Layer 3: Padding
Layer 4: Model Prediction
Model Quiz - 3 Questions
Test your understanding
What happens to words not in the vocabulary during text to sequence conversion?
AThey are assigned random indices
BThey are replaced with a special OOV token index
CThey are removed from the sentence
DThey cause an error and stop processing
Key Insight
Handling out-of-vocabulary words by mapping them to a special token index allows the model to process unknown words without errors. This approach keeps the input consistent and helps the model generalize better to new text.