Bird
Raised Fist0
NLPml~8 mins

Tokenization in spaCy in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Tokenization in spaCy
Which metric matters for Tokenization in spaCy and WHY

Tokenization splits text into words or pieces. The key metric is Tokenization Accuracy. It measures how many tokens the model splits correctly compared to a trusted standard. High accuracy means the text is split just right, which helps later steps like understanding meaning or finding keywords.

Confusion matrix for Tokenization
      | Predicted Token | Predicted No Token |
      |-----------------|--------------------|
      | True Token (TP)  | Missed Token (FN)  |
      | False Token (FP) | True No Token (TN) |

      TP: Correctly identified tokens
      FP: Incorrectly added tokens
      FN: Tokens missed by tokenizer
      TN: Correctly identified non-token boundaries
    

Example: If the tokenizer splits "don't" into "do" and "n't" correctly, it counts as TP. If it misses splitting, that is FN.

Tradeoff: Precision vs Recall in Tokenization

Precision means how many tokens the tokenizer predicted are actually correct. High precision means fewer wrong splits.

Recall means how many true tokens the tokenizer found out of all real tokens. High recall means fewer missed splits.

Example: If tokenizer splits too much, precision drops (more wrong tokens). If it splits too little, recall drops (misses tokens).

Good tokenization balances precision and recall to avoid both missing and adding wrong tokens.

Good vs Bad Metric Values for Tokenization
  • Good: Precision and Recall above 95%. Tokenization matches human standard closely.
  • Bad: Precision or Recall below 80%. Many tokens are wrong or missed, causing errors in later text analysis.
  • Example: Precision 98%, Recall 97% means tokenizer is very reliable.
  • Example: Precision 70%, Recall 60% means tokenizer often splits wrongly or misses tokens.
Common Pitfalls in Tokenization Metrics
  • Ignoring context: Some tokens depend on language rules or abbreviations. Simple metrics may miss these nuances.
  • Data leakage: Testing tokenizer on data it was trained on can give too optimistic accuracy.
  • Overfitting: Tokenizer tuned too much on one text type may fail on others.
  • Accuracy paradox: High overall accuracy can hide poor token splits if many tokens are easy.
Self Check

Your tokenizer has 98% accuracy but 12% recall on splitting contractions like "don't". Is it good?

Answer: No. Even with high overall accuracy, very low recall on contractions means many tokens are missed. This hurts understanding and downstream tasks. You should improve recall on these cases.

Key Result
Tokenization accuracy depends on balancing precision and recall to correctly split text into tokens without missing or adding wrong pieces.

Practice

(1/5)
1. What does tokenization do in spaCy?
easy
A. It splits text into smaller pieces called tokens.
B. It trains a machine learning model.
C. It translates text into another language.
D. It visualizes text data.

Solution

  1. Step 1: Understand tokenization concept

    Tokenization means breaking text into smaller parts called tokens, like words or punctuation.
  2. Step 2: Relate to spaCy functionality

    spaCy uses tokenization to prepare text for analysis by splitting it into tokens.
  3. Final Answer:

    It splits text into smaller pieces called tokens. -> Option A
  4. Quick Check:

    Tokenization = splitting text [OK]
Hint: Tokenization means breaking text into tokens [OK]
Common Mistakes:
  • Confusing tokenization with training models
  • Thinking tokenization translates text
  • Assuming tokenization visualizes data
2. Which of the following is the correct way to load the English model in spaCy for tokenization?
easy
A. import spacy; nlp = spacy.tokenize('en')
B. import spacy; nlp = spacy.load('en_core_web_sm')
C. import spacy; nlp = spacy.model('english')
D. import spacy; nlp = spacy.load_model('english')

Solution

  1. Step 1: Recall spaCy model loading syntax

    spaCy loads models using spacy.load with the model name as a string.
  2. Step 2: Identify correct model name and function

    The English small model is 'en_core_web_sm' and loaded by spacy.load('en_core_web_sm').
  3. Final Answer:

    import spacy; nlp = spacy.load('en_core_web_sm') -> Option B
  4. Quick Check:

    Use spacy.load('model_name') to load models [OK]
Hint: Use spacy.load('model_name') to load models [OK]
Common Mistakes:
  • Using spacy.tokenize instead of spacy.load
  • Wrong model names like 'english' instead of 'en_core_web_sm'
  • Using non-existent functions like load_model
3. What will be the output tokens list from this code snippet?
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, world!')
tokens = [token.text for token in doc]
print(tokens)
medium
A. ['Hello', ',', 'world', '!']
B. ['Hello,', 'world!']
C. ['Hello world']
D. ['Hello', 'world!']

Solution

  1. Step 1: Understand spaCy tokenization behavior

    spaCy splits punctuation from words, so commas and exclamation marks become separate tokens.
  2. Step 2: Analyze the given text 'Hello, world!'

    Tokens will be 'Hello', ',', 'world', and '!' separately.
  3. Final Answer:

    ['Hello', ',', 'world', '!'] -> Option A
  4. Quick Check:

    spaCy separates punctuation as tokens [OK]
Hint: Remember spaCy splits punctuation into separate tokens [OK]
Common Mistakes:
  • Keeping punctuation attached to words
  • Combining words into one token
  • Ignoring punctuation tokens
4. Identify the error in this spaCy tokenization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Test sentence.')
for token in doc:
print(token.text)
medium
A. The token.text attribute does not exist.
B. Wrong model name used in spacy.load.
C. Missing indentation for print inside the for loop.
D. The variable 'doc' is not defined.

Solution

  1. Step 1: Check Python syntax for loops

    Python requires the code inside a for loop to be indented properly.
  2. Step 2: Inspect the given code

    The print statement is not indented under the for loop, causing an IndentationError.
  3. Final Answer:

    Missing indentation for print inside the for loop. -> Option C
  4. Quick Check:

    Indent loop body code in Python [OK]
Hint: Indent code inside loops to avoid errors [OK]
Common Mistakes:
  • Ignoring Python indentation rules
  • Assuming model name is wrong
  • Thinking token.text is invalid
5. You want to tokenize a sentence but keep contractions like "don't" as one token using spaCy. Which approach is best?
hard
A. Use the default spaCy tokenizer without changes.
B. Split contractions manually after tokenization.
C. Replace contractions with full words before tokenization.
D. Modify the tokenizer exceptions to keep contractions as single tokens.

Solution

  1. Step 1: Understand spaCy's default tokenizer behavior

    By default, spaCy splits contractions like "don't" into two tokens: 'do' and "n't".
  2. Step 2: Identify how to keep contractions as one token

    Modifying tokenizer exceptions allows spaCy to treat contractions as single tokens.
  3. Final Answer:

    Modify the tokenizer exceptions to keep contractions as single tokens. -> Option D
  4. Quick Check:

    Customize tokenizer exceptions to control token splits [OK]
Hint: Change tokenizer exceptions to keep contractions whole [OK]
Common Mistakes:
  • Using default tokenizer expecting contractions as one token
  • Splitting contractions manually after tokenization
  • Replacing contractions before tokenization unnecessarily