Bird
Raised Fist0
NLPml~8 mins

BERT tokenization (WordPiece) in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - BERT tokenization (WordPiece)
Which metric matters for BERT tokenization (WordPiece) and WHY

BERT tokenization breaks words into smaller pieces called tokens. The key metric to check is tokenization coverage, which shows how well the tokenizer splits words into known pieces. Good coverage means fewer unknown tokens, helping the model understand text better.

Another important metric is tokenization consistency, ensuring the same word is split the same way every time. This helps the model learn stable word meanings.

Confusion matrix or equivalent visualization

Instead of a confusion matrix, we use a tokenization example comparison to see how words are split:

Original text: "unhappiness"
WordPiece tokens: ["un", "##happy", "##ness"]

Unknown tokens: 0

Coverage: 100% known tokens

If unknown tokens appear, e.g. "unhappyness" might tokenize as ["un", "##happ", "##yn", "##ess"] with some unknown pieces.
    
Precision vs Recall tradeoff (or equivalent) with concrete examples

For tokenization, the tradeoff is between vocabulary size and token granularity:

  • A large vocabulary means fewer splits, so tokens are more precise (like whole words). But it needs more memory and can miss rare words.
  • A small vocabulary means more splits into subwords, increasing recall of rare words but making tokens less precise and longer sequences.

Example: "playing" can be one token or split into "play" + "##ing". Smaller vocab helps handle new words like "playings" by splitting.

What "good" vs "bad" metric values look like for BERT tokenization

Good tokenization:

  • High coverage: Most words split into known tokens (e.g., > 95% coverage)
  • Consistent splits: Same words always tokenized the same way
  • Balanced vocabulary size: Not too big or too small

Bad tokenization:

  • Many unknown tokens, hurting model understanding
  • Inconsistent token splits causing confusion
  • Too large vocabulary causing slow training or too small causing long token sequences
Metrics pitfalls
  • Ignoring unknown tokens: Overlooking unknown tokens can hide poor coverage.
  • Overfitting vocabulary: Making vocabulary too specific to training data hurts generalization.
  • Long token sequences: Too many splits increase sequence length, slowing training and inference.
  • Inconsistent tokenization: Different splits for same words confuse the model.
Self-check question

Your tokenizer has 98% coverage but splits common words inconsistently. Is it good?

Answer: No. High coverage is good, but inconsistent splits confuse the model. Both coverage and consistency matter for good tokenization.

Key Result
High token coverage and consistent token splits are key to effective BERT WordPiece tokenization.

Practice

(1/5)
1. What is the main purpose of BERT's WordPiece tokenization?
easy
A. To split words into smaller known pieces for better handling of unknown words
B. To translate text into another language
C. To remove stop words from sentences
D. To convert text into numerical vectors directly

Solution

  1. Step 1: Understand WordPiece tokenization

    WordPiece breaks words into smaller parts called tokens, especially for unknown or rare words.
  2. Step 2: Identify the purpose of this splitting

    This splitting helps the model recognize parts of words it has seen before, improving understanding.
  3. Final Answer:

    To split words into smaller known pieces for better handling of unknown words -> Option A
  4. Quick Check:

    WordPiece = splitting unknown words [OK]
Hint: WordPiece breaks unknown words into known parts [OK]
Common Mistakes:
  • Thinking WordPiece translates text
  • Confusing tokenization with stop word removal
  • Assuming WordPiece directly converts text to numbers
2. Which of the following is the correct way to represent the word 'unaffable' using WordPiece tokens?
easy
A. ["un", "##affable"]
B. ["unaffable"]
C. ["un", "aff", "able"]
D. ["un", "##aff", "##able"]

Solution

  1. Step 1: Understand WordPiece token format

    WordPiece uses '##' to mark tokens that continue from a previous token.
  2. Step 2: Analyze the options

    ["un", "##aff", "##able"] correctly splits 'unaffable' into 'un' + '##aff' + '##able', showing continuation tokens.
  3. Final Answer:

    ["un", "##aff", "##able"] -> Option D
  4. Quick Check:

    Continuation tokens start with ## [OK]
Hint: Look for '##' prefix on continuation tokens [OK]
Common Mistakes:
  • Ignoring '##' prefix for continuation tokens
  • Treating whole word as one token always
  • Splitting tokens without '##' where needed
3. Given the sentence "Playing football is fun", which is the correct WordPiece tokenization output?
medium
A. ["Play", "##ing", "football", "is", "fun"]
B. ["Playing", "football", "is", "fun"]
C. ["Play", "##ing", "foot", "##ball", "is", "fun"]
D. ["Play", "ing", "foot", "##ball", "is", "fun"]

Solution

  1. Step 1: Tokenize 'Playing'

    WordPiece splits 'Playing' into 'Play' and '##ing' because 'Play' is a known root.
  2. Step 2: Tokenize 'football'

    It splits 'football' into 'foot' and '##ball' as common subwords.
  3. Step 3: Check remaining words

    'is' and 'fun' are common words and remain as single tokens.
  4. Final Answer:

    ["Play", "##ing", "foot", "##ball", "is", "fun"] -> Option C
  5. Quick Check:

    Known roots + ## continuation tokens [OK]
Hint: Split known roots, add ## for continuations [OK]
Common Mistakes:
  • Not splitting compound words like football
  • Missing ## prefix on continuation tokens
  • Treating all words as single tokens
4. Identify the error in this WordPiece tokenization output for the word 'unhappy': ["un", "happy"]
medium
A. Missing '##' prefix on 'happy' token
B. Incorrect splitting; 'unhappy' should be one token
C. Tokens should be reversed order
D. No error; this is correct tokenization

Solution

  1. Step 1: Check token continuation rules

    In WordPiece, tokens after the first must start with '##' to show continuation.
  2. Step 2: Analyze given tokens

    'happy' is a continuation of 'un', so it should be '##happy', not 'happy'.
  3. Final Answer:

    Missing '##' prefix on 'happy' token -> Option A
  4. Quick Check:

    Continuation tokens need '##' prefix [OK]
Hint: Check if continuation tokens start with '##' [OK]
Common Mistakes:
  • Forgetting '##' on continuation tokens
  • Assuming all tokens are standalone
  • Thinking order of tokens matters here
5. You want to tokenize the sentence "The unbreakable bond" using BERT's WordPiece tokenizer. Which tokenization output correctly handles the unknown word 'unbreakable'?
hard
A. ["The", "unbreakable", "bond"]
B. ["The", "un", "##break", "##able", "bond"]
C. ["The", "un", "breakable", "bond"]
D. ["The", "un", "##breakable", "bond"]

Solution

  1. Step 1: Understand unknown word handling

    WordPiece breaks unknown words into smaller known subwords with '##' for continuation.
  2. Step 2: Analyze 'unbreakable'

    It splits into 'un' + '##break' + '##able' to represent parts seen in vocabulary.
  3. Step 3: Check other tokens

    'The' and 'bond' are common words and remain as single tokens.
  4. Final Answer:

    ["The", "un", "##break", "##able", "bond"] -> Option B
  5. Quick Check:

    Unknown words split into known subwords with ## [OK]
Hint: Split unknown words into known parts with ## prefix [OK]
Common Mistakes:
  • Treating unknown words as single tokens
  • Missing ## on continuation tokens
  • Splitting without ## prefix on continuation