What is BERT tokenization (WordPiece) in NLP?

BERT tokenization breaks text into smaller pieces called tokens. This helps the model understand words and parts of words better.

BERT tokenization (WordPiece) in NLP - Syntax, Examples & Explanation

Practice

(1/5)

1. What is the main purpose of BERT's WordPiece tokenization?

easy

A. To split words into smaller known pieces for better handling of unknown words

B. To translate text into another language

C. To remove stop words from sentences

D. To convert text into numerical vectors directly

Solution

Step 1: Understand WordPiece tokenization
WordPiece breaks words into smaller parts called tokens, especially for unknown or rare words.
Step 2: Identify the purpose of this splitting
This splitting helps the model recognize parts of words it has seen before, improving understanding.
Final Answer:
To split words into smaller known pieces for better handling of unknown words -> Option A
Quick Check:
WordPiece = splitting unknown words [OK]

Hint: WordPiece breaks unknown words into known parts [OK]

Common Mistakes:

Thinking WordPiece translates text
Confusing tokenization with stop word removal
Assuming WordPiece directly converts text to numbers

2. Which of the following is the correct way to represent the word 'unaffable' using WordPiece tokens?

easy

A. ["un", "##affable"]

B. ["unaffable"]

C. ["un", "aff", "able"]

D. ["un", "##aff", "##able"]

Solution

Step 1: Understand WordPiece token format
WordPiece uses '##' to mark tokens that continue from a previous token.
Step 2: Analyze the options
["un", "##aff", "##able"] correctly splits 'unaffable' into 'un' + '##aff' + '##able', showing continuation tokens.
Final Answer:
["un", "##aff", "##able"] -> Option D
Quick Check:
Continuation tokens start with ## [OK]

Hint: Look for '##' prefix on continuation tokens [OK]

Common Mistakes:

Ignoring '##' prefix for continuation tokens
Treating whole word as one token always
Splitting tokens without '##' where needed

3. Given the sentence "Playing football is fun", which is the correct WordPiece tokenization output?

medium

A. ["Play", "##ing", "football", "is", "fun"]

B. ["Playing", "football", "is", "fun"]

C. ["Play", "##ing", "foot", "##ball", "is", "fun"]

D. ["Play", "ing", "foot", "##ball", "is", "fun"]

Solution

Step 1: Tokenize 'Playing'
WordPiece splits 'Playing' into 'Play' and '##ing' because 'Play' is a known root.
Step 2: Tokenize 'football'
It splits 'football' into 'foot' and '##ball' as common subwords.
Step 3: Check remaining words
'is' and 'fun' are common words and remain as single tokens.
Final Answer:
["Play", "##ing", "foot", "##ball", "is", "fun"] -> Option C
Quick Check:
Known roots + ## continuation tokens [OK]

Hint: Split known roots, add ## for continuations [OK]

Common Mistakes:

Not splitting compound words like football
Missing ## prefix on continuation tokens
Treating all words as single tokens

4. Identify the error in this WordPiece tokenization output for the word 'unhappy': ["un", "happy"]

medium

A. Missing '##' prefix on 'happy' token

B. Incorrect splitting; 'unhappy' should be one token

C. Tokens should be reversed order

D. No error; this is correct tokenization

Solution

Step 1: Check token continuation rules
In WordPiece, tokens after the first must start with '##' to show continuation.
Step 2: Analyze given tokens
'happy' is a continuation of 'un', so it should be '##happy', not 'happy'.
Final Answer:
Missing '##' prefix on 'happy' token -> Option A
Quick Check:
Continuation tokens need '##' prefix [OK]

Hint: Check if continuation tokens start with '##' [OK]

Common Mistakes:

Forgetting '##' on continuation tokens
Assuming all tokens are standalone
Thinking order of tokens matters here

5. You want to tokenize the sentence "The unbreakable bond" using BERT's WordPiece tokenizer. Which tokenization output correctly handles the unknown word 'unbreakable'?

hard

A. ["The", "unbreakable", "bond"]

B. ["The", "un", "##break", "##able", "bond"]

C. ["The", "un", "breakable", "bond"]

D. ["The", "un", "##breakable", "bond"]

Solution

Step 1: Understand unknown word handling
WordPiece breaks unknown words into smaller known subwords with '##' for continuation.
Step 2: Analyze 'unbreakable'
It splits into 'un' + '##break' + '##able' to represent parts seen in vocabulary.
Step 3: Check other tokens
'The' and 'bond' are common words and remain as single tokens.
Final Answer:
["The", "un", "##break", "##able", "bond"] -> Option B
Quick Check:
Unknown words split into known subwords with ## [OK]

Hint: Split unknown words into known parts with ## prefix [OK]

Common Mistakes:

Treating unknown words as single tokens
Missing ## on continuation tokens
Splitting without ## prefix on continuation

Start learning this pattern below

Practice

Solution

Step 1: Understand WordPiece tokenization

Step 2: Identify the purpose of this splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand WordPiece token format

Step 2: Analyze the options

Final Answer:

Quick Check:

Solution

Step 1: Tokenize 'Playing'

Step 2: Tokenize 'football'

Step 3: Check remaining words

Final Answer:

Quick Check:

Solution

Step 1: Check token continuation rules

Step 2: Analyze given tokens

Final Answer:

Quick Check:

Solution

Step 1: Understand unknown word handling

Step 2: Analyze 'unbreakable'

Step 3: Check other tokens

Final Answer:

Quick Check: