0
0
NLPml~20 mins

BERT tokenization (WordPiece) in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
BERT Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
How does WordPiece handle unknown words?

When BERT's WordPiece tokenizer encounters a word not in its vocabulary, what does it do?

AIt replaces the entire word with a special token [PAD].
BIt breaks the word into smaller known subword units until all parts are recognized or uses [UNK] if no parts match.
CIt ignores the word and removes it from the input sequence.
DIt treats the whole word as a single token even if unknown.
Attempts:
2 left
💡 Hint

Think about how WordPiece tries to represent words using pieces it knows.

Predict Output
intermediate
2:00remaining
Output tokens from WordPiece tokenizer

Given the input sentence "unaffable", what is the output token list from BERT's WordPiece tokenizer?

NLP
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize('unaffable')
print(tokens)
A["un", "aff", "able"]
B["unaffable"]
C["una", "##ffa", "##ble"]
D["un", "##aff", "##able"]
Attempts:
2 left
💡 Hint

Look for subwords starting with ## indicating continuation pieces.

Model Choice
advanced
2:00remaining
Choosing tokenizer for domain-specific text

You want to fine-tune BERT on medical text with many rare terms. Which tokenizer approach is best?

AUse a whitespace tokenizer that splits only on spaces.
BUse a character-level tokenizer that splits every character.
CTrain a new WordPiece tokenizer on the medical corpus to capture rare terms better.
DUse the original BERT WordPiece tokenizer without changes.
Attempts:
2 left
💡 Hint

Think about how to handle many rare or new words effectively.

Metrics
advanced
1:30remaining
Measuring tokenization efficiency

You compare two tokenizers on the same text. Tokenizer A produces 100 tokens; Tokenizer B produces 130 tokens. Which statement is true about tokenization efficiency?

ATokenizer A is more efficient because it uses fewer tokens to represent the text.
BTokenizer B is more efficient because it produces more tokens for detail.
CBoth are equally efficient because token count does not matter.
DTokenizer B is more efficient because more tokens mean better accuracy.
Attempts:
2 left
💡 Hint

Fewer tokens usually mean less computation and simpler input.

🔧 Debug
expert
2:00remaining
Why does this WordPiece tokenization code raise an error?

Consider this code snippet:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(12345)
print(tokens)

What error does this code raise?

ATypeError because the input to tokenize must be a string, not an integer.
BValueError because the number 12345 is out of vocabulary range.
CKeyError because 12345 is not a token in the vocabulary.
DNo error; it tokenizes the number as a string.
Attempts:
2 left
💡 Hint

Check the input type expected by the tokenizer.