NLPml~20 mins

BERT tokenization (WordPiece) in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

BERT Tokenization Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

How does WordPiece handle unknown words?

When BERT's WordPiece tokenizer encounters a word not in its vocabulary, what does it do?

AIt replaces the entire word with a special token [PAD].

BIt breaks the word into smaller known subword units until all parts are recognized or uses [UNK] if no parts match.

CIt ignores the word and removes it from the input sequence.

DIt treats the whole word as a single token even if unknown.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output tokens from WordPiece tokenizer

Given the input sentence "unaffable", what is the output token list from BERT's WordPiece tokenizer?

NLP

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize('unaffable')
print(tokens)

A["un", "aff", "able"]

B["unaffable"]

C["una", "##ffa", "##ble"]

D["un", "##aff", "##able"]

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Choosing tokenizer for domain-specific text

You want to fine-tune BERT on medical text with many rare terms. Which tokenizer approach is best?

AUse a whitespace tokenizer that splits only on spaces.

BUse a character-level tokenizer that splits every character.

CTrain a new WordPiece tokenizer on the medical corpus to capture rare terms better.

DUse the original BERT WordPiece tokenizer without changes.

Attempts:

2 left

❓ Metrics

advanced

1:30remaining

Measuring tokenization efficiency

You compare two tokenizers on the same text. Tokenizer A produces 100 tokens; Tokenizer B produces 130 tokens. Which statement is true about tokenization efficiency?

ATokenizer A is more efficient because it uses fewer tokens to represent the text.

BTokenizer B is more efficient because it produces more tokens for detail.

CBoth are equally efficient because token count does not matter.

DTokenizer B is more efficient because more tokens mean better accuracy.

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Why does this WordPiece tokenization code raise an error?

Consider this code snippet:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(12345)
print(tokens)

What error does this code raise?

ATypeError because the input to tokenize must be a string, not an integer.

BValueError because the number 12345 is out of vocabulary range.

CKeyError because 12345 is not a token in the vocabulary.

DNo error; it tokenizes the number as a string.

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of BERT's WordPiece tokenization?

easy

A. To split words into smaller known pieces for better handling of unknown words

B. To translate text into another language

C. To remove stop words from sentences

D. To convert text into numerical vectors directly

BERT tokenization (WordPiece) in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand WordPiece tokenization

Step 2: Identify the purpose of this splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand WordPiece token format

Step 2: Analyze the options

Final Answer:

Quick Check:

Solution

Step 1: Tokenize 'Playing'

Step 2: Tokenize 'football'

Step 3: Check remaining words

Final Answer:

Quick Check:

Solution

Step 1: Check token continuation rules

Step 2: Analyze given tokens

Final Answer:

Quick Check:

Solution

Step 1: Understand unknown word handling

Step 2: Analyze 'unbreakable'

Step 3: Check other tokens

Final Answer:

Quick Check: