Bird
Raised Fist0
NLPml~20 mins

BERT tokenization (WordPiece) in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
BERT Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
How does WordPiece handle unknown words?

When BERT's WordPiece tokenizer encounters a word not in its vocabulary, what does it do?

AIt replaces the entire word with a special token [PAD].
BIt breaks the word into smaller known subword units until all parts are recognized or uses [UNK] if no parts match.
CIt ignores the word and removes it from the input sequence.
DIt treats the whole word as a single token even if unknown.
Attempts:
2 left
💡 Hint

Think about how WordPiece tries to represent words using pieces it knows.

Predict Output
intermediate
2:00remaining
Output tokens from WordPiece tokenizer

Given the input sentence "unaffable", what is the output token list from BERT's WordPiece tokenizer?

NLP
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize('unaffable')
print(tokens)
A["un", "aff", "able"]
B["unaffable"]
C["una", "##ffa", "##ble"]
D["un", "##aff", "##able"]
Attempts:
2 left
💡 Hint

Look for subwords starting with ## indicating continuation pieces.

Model Choice
advanced
2:00remaining
Choosing tokenizer for domain-specific text

You want to fine-tune BERT on medical text with many rare terms. Which tokenizer approach is best?

AUse a whitespace tokenizer that splits only on spaces.
BUse a character-level tokenizer that splits every character.
CTrain a new WordPiece tokenizer on the medical corpus to capture rare terms better.
DUse the original BERT WordPiece tokenizer without changes.
Attempts:
2 left
💡 Hint

Think about how to handle many rare or new words effectively.

Metrics
advanced
1:30remaining
Measuring tokenization efficiency

You compare two tokenizers on the same text. Tokenizer A produces 100 tokens; Tokenizer B produces 130 tokens. Which statement is true about tokenization efficiency?

ATokenizer A is more efficient because it uses fewer tokens to represent the text.
BTokenizer B is more efficient because it produces more tokens for detail.
CBoth are equally efficient because token count does not matter.
DTokenizer B is more efficient because more tokens mean better accuracy.
Attempts:
2 left
💡 Hint

Fewer tokens usually mean less computation and simpler input.

🔧 Debug
expert
2:00remaining
Why does this WordPiece tokenization code raise an error?

Consider this code snippet:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(12345)
print(tokens)

What error does this code raise?

ATypeError because the input to tokenize must be a string, not an integer.
BValueError because the number 12345 is out of vocabulary range.
CKeyError because 12345 is not a token in the vocabulary.
DNo error; it tokenizes the number as a string.
Attempts:
2 left
💡 Hint

Check the input type expected by the tokenizer.

Practice

(1/5)
1. What is the main purpose of BERT's WordPiece tokenization?
easy
A. To split words into smaller known pieces for better handling of unknown words
B. To translate text into another language
C. To remove stop words from sentences
D. To convert text into numerical vectors directly

Solution

  1. Step 1: Understand WordPiece tokenization

    WordPiece breaks words into smaller parts called tokens, especially for unknown or rare words.
  2. Step 2: Identify the purpose of this splitting

    This splitting helps the model recognize parts of words it has seen before, improving understanding.
  3. Final Answer:

    To split words into smaller known pieces for better handling of unknown words -> Option A
  4. Quick Check:

    WordPiece = splitting unknown words [OK]
Hint: WordPiece breaks unknown words into known parts [OK]
Common Mistakes:
  • Thinking WordPiece translates text
  • Confusing tokenization with stop word removal
  • Assuming WordPiece directly converts text to numbers
2. Which of the following is the correct way to represent the word 'unaffable' using WordPiece tokens?
easy
A. ["un", "##affable"]
B. ["unaffable"]
C. ["un", "aff", "able"]
D. ["un", "##aff", "##able"]

Solution

  1. Step 1: Understand WordPiece token format

    WordPiece uses '##' to mark tokens that continue from a previous token.
  2. Step 2: Analyze the options

    ["un", "##aff", "##able"] correctly splits 'unaffable' into 'un' + '##aff' + '##able', showing continuation tokens.
  3. Final Answer:

    ["un", "##aff", "##able"] -> Option D
  4. Quick Check:

    Continuation tokens start with ## [OK]
Hint: Look for '##' prefix on continuation tokens [OK]
Common Mistakes:
  • Ignoring '##' prefix for continuation tokens
  • Treating whole word as one token always
  • Splitting tokens without '##' where needed
3. Given the sentence "Playing football is fun", which is the correct WordPiece tokenization output?
medium
A. ["Play", "##ing", "football", "is", "fun"]
B. ["Playing", "football", "is", "fun"]
C. ["Play", "##ing", "foot", "##ball", "is", "fun"]
D. ["Play", "ing", "foot", "##ball", "is", "fun"]

Solution

  1. Step 1: Tokenize 'Playing'

    WordPiece splits 'Playing' into 'Play' and '##ing' because 'Play' is a known root.
  2. Step 2: Tokenize 'football'

    It splits 'football' into 'foot' and '##ball' as common subwords.
  3. Step 3: Check remaining words

    'is' and 'fun' are common words and remain as single tokens.
  4. Final Answer:

    ["Play", "##ing", "foot", "##ball", "is", "fun"] -> Option C
  5. Quick Check:

    Known roots + ## continuation tokens [OK]
Hint: Split known roots, add ## for continuations [OK]
Common Mistakes:
  • Not splitting compound words like football
  • Missing ## prefix on continuation tokens
  • Treating all words as single tokens
4. Identify the error in this WordPiece tokenization output for the word 'unhappy': ["un", "happy"]
medium
A. Missing '##' prefix on 'happy' token
B. Incorrect splitting; 'unhappy' should be one token
C. Tokens should be reversed order
D. No error; this is correct tokenization

Solution

  1. Step 1: Check token continuation rules

    In WordPiece, tokens after the first must start with '##' to show continuation.
  2. Step 2: Analyze given tokens

    'happy' is a continuation of 'un', so it should be '##happy', not 'happy'.
  3. Final Answer:

    Missing '##' prefix on 'happy' token -> Option A
  4. Quick Check:

    Continuation tokens need '##' prefix [OK]
Hint: Check if continuation tokens start with '##' [OK]
Common Mistakes:
  • Forgetting '##' on continuation tokens
  • Assuming all tokens are standalone
  • Thinking order of tokens matters here
5. You want to tokenize the sentence "The unbreakable bond" using BERT's WordPiece tokenizer. Which tokenization output correctly handles the unknown word 'unbreakable'?
hard
A. ["The", "unbreakable", "bond"]
B. ["The", "un", "##break", "##able", "bond"]
C. ["The", "un", "breakable", "bond"]
D. ["The", "un", "##breakable", "bond"]

Solution

  1. Step 1: Understand unknown word handling

    WordPiece breaks unknown words into smaller known subwords with '##' for continuation.
  2. Step 2: Analyze 'unbreakable'

    It splits into 'un' + '##break' + '##able' to represent parts seen in vocabulary.
  3. Step 3: Check other tokens

    'The' and 'bond' are common words and remain as single tokens.
  4. Final Answer:

    ["The", "un", "##break", "##able", "bond"] -> Option B
  5. Quick Check:

    Unknown words split into known subwords with ## [OK]
Hint: Split unknown words into known parts with ## prefix [OK]
Common Mistakes:
  • Treating unknown words as single tokens
  • Missing ## on continuation tokens
  • Splitting without ## prefix on continuation