Bird
Raised Fist0
NLPml~5 mins

BERT tokenization (WordPiece) in NLP - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the main purpose of BERT tokenization using WordPiece?
To split words into smaller subword units so that rare or unknown words can be represented as combinations of known pieces, improving the model's understanding of language.
Click to reveal answer
beginner
How does WordPiece handle unknown words during tokenization?
It breaks unknown words into smaller known subword units, starting from the beginning of the word and adding pieces until the whole word is covered, allowing the model to understand new words from familiar parts.
Click to reveal answer
beginner
Why does WordPiece add '##' before some tokens?
The '##' symbol marks that the token is a continuation of a previous token and not a standalone word, helping the model know how subwords connect to form full words.
Click to reveal answer
intermediate
Explain the difference between a word and a WordPiece token in BERT tokenization.
A word is a complete unit of language, while a WordPiece token can be a full word or a smaller part of a word. WordPiece tokens allow BERT to handle rare or new words by breaking them into known pieces.
Click to reveal answer
intermediate
What is the advantage of using WordPiece tokenization over simple word-level tokenization?
WordPiece reduces the vocabulary size and handles rare or new words better by splitting them into subwords, which helps the model learn more efficiently and generalize to unseen words.
Click to reveal answer
What does the '##' symbol indicate in WordPiece tokens?
AThe token is a suffix or continuation of a previous token
BThe token is a prefix of a word
CThe token is an unknown word
DThe token is a standalone word
Why does BERT use WordPiece tokenization instead of splitting only by spaces?
ATo increase vocabulary size
BTo handle rare and unknown words by breaking them into smaller parts
CTo remove punctuation
DTo translate words into another language
If the word 'unhappiness' is unknown, how might WordPiece tokenize it?
A['unhappiness']
B['un', '##happiness']
C['un', '##happy', '##ness']
D['unh', '##app', '##iness']
What is a key benefit of having a smaller vocabulary with WordPiece?
ALess accurate predictions
BMore complex model architecture
CMore memory usage
DFaster training and better handling of rare words
Which of these is NOT true about WordPiece tokenization?
AIt always treats each word as a single token
BIt uses '##' to mark subword continuations
CIt splits words into subwords
DIt helps handle unknown words
Describe how BERT's WordPiece tokenization works and why it is useful.
Think about how breaking words into smaller parts helps the model.
You got /5 concepts.
    Explain the role of the '##' symbol in WordPiece tokens and give an example.
    Consider how subwords connect to form full words.
    You got /3 concepts.

      Practice

      (1/5)
      1. What is the main purpose of BERT's WordPiece tokenization?
      easy
      A. To split words into smaller known pieces for better handling of unknown words
      B. To translate text into another language
      C. To remove stop words from sentences
      D. To convert text into numerical vectors directly

      Solution

      1. Step 1: Understand WordPiece tokenization

        WordPiece breaks words into smaller parts called tokens, especially for unknown or rare words.
      2. Step 2: Identify the purpose of this splitting

        This splitting helps the model recognize parts of words it has seen before, improving understanding.
      3. Final Answer:

        To split words into smaller known pieces for better handling of unknown words -> Option A
      4. Quick Check:

        WordPiece = splitting unknown words [OK]
      Hint: WordPiece breaks unknown words into known parts [OK]
      Common Mistakes:
      • Thinking WordPiece translates text
      • Confusing tokenization with stop word removal
      • Assuming WordPiece directly converts text to numbers
      2. Which of the following is the correct way to represent the word 'unaffable' using WordPiece tokens?
      easy
      A. ["un", "##affable"]
      B. ["unaffable"]
      C. ["un", "aff", "able"]
      D. ["un", "##aff", "##able"]

      Solution

      1. Step 1: Understand WordPiece token format

        WordPiece uses '##' to mark tokens that continue from a previous token.
      2. Step 2: Analyze the options

        ["un", "##aff", "##able"] correctly splits 'unaffable' into 'un' + '##aff' + '##able', showing continuation tokens.
      3. Final Answer:

        ["un", "##aff", "##able"] -> Option D
      4. Quick Check:

        Continuation tokens start with ## [OK]
      Hint: Look for '##' prefix on continuation tokens [OK]
      Common Mistakes:
      • Ignoring '##' prefix for continuation tokens
      • Treating whole word as one token always
      • Splitting tokens without '##' where needed
      3. Given the sentence "Playing football is fun", which is the correct WordPiece tokenization output?
      medium
      A. ["Play", "##ing", "football", "is", "fun"]
      B. ["Playing", "football", "is", "fun"]
      C. ["Play", "##ing", "foot", "##ball", "is", "fun"]
      D. ["Play", "ing", "foot", "##ball", "is", "fun"]

      Solution

      1. Step 1: Tokenize 'Playing'

        WordPiece splits 'Playing' into 'Play' and '##ing' because 'Play' is a known root.
      2. Step 2: Tokenize 'football'

        It splits 'football' into 'foot' and '##ball' as common subwords.
      3. Step 3: Check remaining words

        'is' and 'fun' are common words and remain as single tokens.
      4. Final Answer:

        ["Play", "##ing", "foot", "##ball", "is", "fun"] -> Option C
      5. Quick Check:

        Known roots + ## continuation tokens [OK]
      Hint: Split known roots, add ## for continuations [OK]
      Common Mistakes:
      • Not splitting compound words like football
      • Missing ## prefix on continuation tokens
      • Treating all words as single tokens
      4. Identify the error in this WordPiece tokenization output for the word 'unhappy': ["un", "happy"]
      medium
      A. Missing '##' prefix on 'happy' token
      B. Incorrect splitting; 'unhappy' should be one token
      C. Tokens should be reversed order
      D. No error; this is correct tokenization

      Solution

      1. Step 1: Check token continuation rules

        In WordPiece, tokens after the first must start with '##' to show continuation.
      2. Step 2: Analyze given tokens

        'happy' is a continuation of 'un', so it should be '##happy', not 'happy'.
      3. Final Answer:

        Missing '##' prefix on 'happy' token -> Option A
      4. Quick Check:

        Continuation tokens need '##' prefix [OK]
      Hint: Check if continuation tokens start with '##' [OK]
      Common Mistakes:
      • Forgetting '##' on continuation tokens
      • Assuming all tokens are standalone
      • Thinking order of tokens matters here
      5. You want to tokenize the sentence "The unbreakable bond" using BERT's WordPiece tokenizer. Which tokenization output correctly handles the unknown word 'unbreakable'?
      hard
      A. ["The", "unbreakable", "bond"]
      B. ["The", "un", "##break", "##able", "bond"]
      C. ["The", "un", "breakable", "bond"]
      D. ["The", "un", "##breakable", "bond"]

      Solution

      1. Step 1: Understand unknown word handling

        WordPiece breaks unknown words into smaller known subwords with '##' for continuation.
      2. Step 2: Analyze 'unbreakable'

        It splits into 'un' + '##break' + '##able' to represent parts seen in vocabulary.
      3. Step 3: Check other tokens

        'The' and 'bond' are common words and remain as single tokens.
      4. Final Answer:

        ["The", "un", "##break", "##able", "bond"] -> Option B
      5. Quick Check:

        Unknown words split into known subwords with ## [OK]
      Hint: Split unknown words into known parts with ## prefix [OK]
      Common Mistakes:
      • Treating unknown words as single tokens
      • Missing ## on continuation tokens
      • Splitting without ## prefix on continuation