Tokenization splits text into words or pieces. The key metric is Tokenization Accuracy. It measures how many tokens the model splits correctly compared to a trusted standard. High accuracy means the text is split just right, which helps later steps like understanding meaning or finding keywords.
Tokenization in spaCy in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Token | Predicted No Token |
|-----------------|--------------------|
| True Token (TP) | Missed Token (FN) |
| False Token (FP) | True No Token (TN) |
TP: Correctly identified tokens
FP: Incorrectly added tokens
FN: Tokens missed by tokenizer
TN: Correctly identified non-token boundaries
Example: If the tokenizer splits "don't" into "do" and "n't" correctly, it counts as TP. If it misses splitting, that is FN.
Precision means how many tokens the tokenizer predicted are actually correct. High precision means fewer wrong splits.
Recall means how many true tokens the tokenizer found out of all real tokens. High recall means fewer missed splits.
Example: If tokenizer splits too much, precision drops (more wrong tokens). If it splits too little, recall drops (misses tokens).
Good tokenization balances precision and recall to avoid both missing and adding wrong tokens.
- Good: Precision and Recall above 95%. Tokenization matches human standard closely.
- Bad: Precision or Recall below 80%. Many tokens are wrong or missed, causing errors in later text analysis.
- Example: Precision 98%, Recall 97% means tokenizer is very reliable.
- Example: Precision 70%, Recall 60% means tokenizer often splits wrongly or misses tokens.
- Ignoring context: Some tokens depend on language rules or abbreviations. Simple metrics may miss these nuances.
- Data leakage: Testing tokenizer on data it was trained on can give too optimistic accuracy.
- Overfitting: Tokenizer tuned too much on one text type may fail on others.
- Accuracy paradox: High overall accuracy can hide poor token splits if many tokens are easy.
Your tokenizer has 98% accuracy but 12% recall on splitting contractions like "don't". Is it good?
Answer: No. Even with high overall accuracy, very low recall on contractions means many tokens are missed. This hurts understanding and downstream tasks. You should improve recall on these cases.
Practice
Solution
Step 1: Understand tokenization concept
Tokenization means breaking text into smaller parts called tokens, like words or punctuation.Step 2: Relate to spaCy functionality
spaCy uses tokenization to prepare text for analysis by splitting it into tokens.Final Answer:
It splits text into smaller pieces called tokens. -> Option AQuick Check:
Tokenization = splitting text [OK]
- Confusing tokenization with training models
- Thinking tokenization translates text
- Assuming tokenization visualizes data
Solution
Step 1: Recall spaCy model loading syntax
spaCy loads models using spacy.load with the model name as a string.Step 2: Identify correct model name and function
The English small model is 'en_core_web_sm' and loaded by spacy.load('en_core_web_sm').Final Answer:
import spacy; nlp = spacy.load('en_core_web_sm') -> Option BQuick Check:
Use spacy.load('model_name') to load models [OK]
- Using spacy.tokenize instead of spacy.load
- Wrong model names like 'english' instead of 'en_core_web_sm'
- Using non-existent functions like load_model
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, world!')
tokens = [token.text for token in doc]
print(tokens)Solution
Step 1: Understand spaCy tokenization behavior
spaCy splits punctuation from words, so commas and exclamation marks become separate tokens.Step 2: Analyze the given text 'Hello, world!'
Tokens will be 'Hello', ',', 'world', and '!' separately.Final Answer:
['Hello', ',', 'world', '!'] -> Option AQuick Check:
spaCy separates punctuation as tokens [OK]
- Keeping punctuation attached to words
- Combining words into one token
- Ignoring punctuation tokens
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Test sentence.')
for token in doc:
print(token.text)Solution
Step 1: Check Python syntax for loops
Python requires the code inside a for loop to be indented properly.Step 2: Inspect the given code
The print statement is not indented under the for loop, causing an IndentationError.Final Answer:
Missing indentation for print inside the for loop. -> Option CQuick Check:
Indent loop body code in Python [OK]
- Ignoring Python indentation rules
- Assuming model name is wrong
- Thinking token.text is invalid
Solution
Step 1: Understand spaCy's default tokenizer behavior
By default, spaCy splits contractions like "don't" into two tokens: 'do' and "n't".Step 2: Identify how to keep contractions as one token
Modifying tokenizer exceptions allows spaCy to treat contractions as single tokens.Final Answer:
Modify the tokenizer exceptions to keep contractions as single tokens. -> Option DQuick Check:
Customize tokenizer exceptions to control token splits [OK]
- Using default tokenizer expecting contractions as one token
- Splitting contractions manually after tokenization
- Replacing contractions before tokenization unnecessarily
