Bird
Raised Fist0
NLPml~10 mins

BERT tokenization (WordPiece) in NLP - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to import the BERT tokenizer from the transformers library.

NLP
from transformers import [1]
Drag options to blanks, or click blank then click option'
ATokenizer
BAutoTokenizer
CBertTokenizer
DBertModel
Attempts:
3 left
💡 Hint
Common Mistakes
Importing BertModel instead of BertTokenizer.
Using a generic Tokenizer class that does not exist.
Confusing AutoTokenizer with BertTokenizer.
2fill in blank
medium

Complete the code to load the pretrained BERT tokenizer for 'bert-base-uncased'.

NLP
tokenizer = BertTokenizer.[1]('bert-base-uncased')
Drag options to blanks, or click blank then click option'
Aload
Bfrom_pretrained
Cinit
Dtokenize
Attempts:
3 left
💡 Hint
Common Mistakes
Using 'load' which is not a method of BertTokenizer.
Using 'init' which is a constructor, not for pretrained loading.
Using 'tokenize' which is for tokenizing text, not loading.
3fill in blank
hard

Fix the error in the code to tokenize the sentence using the tokenizer.

NLP
tokens = tokenizer.[1]('Hello, how are you?')
Drag options to blanks, or click blank then click option'
Atokenize
Bsplit
Cparse
Dencode
Attempts:
3 left
💡 Hint
Common Mistakes
Using 'encode' which returns token IDs, not token strings.
Using 'split' which is a Python string method, not tokenizer method.
Using 'parse' which is not a tokenizer method.
4fill in blank
hard

Fill both blanks to create a dictionary of token ids and attention mask for the input text.

NLP
encoded_input = tokenizer('[1]', return_tensors='pt', padding=True, truncation=True)
input_ids = encoded_input['[2]']
Drag options to blanks, or click blank then click option'
AHello, how are you?
Binput_ids
Cattention_mask
Dtokens
Attempts:
3 left
💡 Hint
Common Mistakes
Using 'tokens' instead of 'input_ids' as dictionary key.
Putting a variable name instead of a string in the first blank.
Using 'attention_mask' key when token IDs are needed.
5fill in blank
hard

Fill all three blanks to decode token ids back to the original text without special tokens.

NLP
decoded_text = tokenizer.[1](encoded_input['[2]'][0], skip_special_tokens=[3])
Drag options to blanks, or click blank then click option'
Adecode
Binput_ids
CTrue
DFalse
Attempts:
3 left
💡 Hint
Common Mistakes
Using 'encode' instead of 'decode'.
Using 'attention_mask' key instead of 'input_ids'.
Setting skip_special_tokens to False, which keeps special tokens.

Practice

(1/5)
1. What is the main purpose of BERT's WordPiece tokenization?
easy
A. To split words into smaller known pieces for better handling of unknown words
B. To translate text into another language
C. To remove stop words from sentences
D. To convert text into numerical vectors directly

Solution

  1. Step 1: Understand WordPiece tokenization

    WordPiece breaks words into smaller parts called tokens, especially for unknown or rare words.
  2. Step 2: Identify the purpose of this splitting

    This splitting helps the model recognize parts of words it has seen before, improving understanding.
  3. Final Answer:

    To split words into smaller known pieces for better handling of unknown words -> Option A
  4. Quick Check:

    WordPiece = splitting unknown words [OK]
Hint: WordPiece breaks unknown words into known parts [OK]
Common Mistakes:
  • Thinking WordPiece translates text
  • Confusing tokenization with stop word removal
  • Assuming WordPiece directly converts text to numbers
2. Which of the following is the correct way to represent the word 'unaffable' using WordPiece tokens?
easy
A. ["un", "##affable"]
B. ["unaffable"]
C. ["un", "aff", "able"]
D. ["un", "##aff", "##able"]

Solution

  1. Step 1: Understand WordPiece token format

    WordPiece uses '##' to mark tokens that continue from a previous token.
  2. Step 2: Analyze the options

    ["un", "##aff", "##able"] correctly splits 'unaffable' into 'un' + '##aff' + '##able', showing continuation tokens.
  3. Final Answer:

    ["un", "##aff", "##able"] -> Option D
  4. Quick Check:

    Continuation tokens start with ## [OK]
Hint: Look for '##' prefix on continuation tokens [OK]
Common Mistakes:
  • Ignoring '##' prefix for continuation tokens
  • Treating whole word as one token always
  • Splitting tokens without '##' where needed
3. Given the sentence "Playing football is fun", which is the correct WordPiece tokenization output?
medium
A. ["Play", "##ing", "football", "is", "fun"]
B. ["Playing", "football", "is", "fun"]
C. ["Play", "##ing", "foot", "##ball", "is", "fun"]
D. ["Play", "ing", "foot", "##ball", "is", "fun"]

Solution

  1. Step 1: Tokenize 'Playing'

    WordPiece splits 'Playing' into 'Play' and '##ing' because 'Play' is a known root.
  2. Step 2: Tokenize 'football'

    It splits 'football' into 'foot' and '##ball' as common subwords.
  3. Step 3: Check remaining words

    'is' and 'fun' are common words and remain as single tokens.
  4. Final Answer:

    ["Play", "##ing", "foot", "##ball", "is", "fun"] -> Option C
  5. Quick Check:

    Known roots + ## continuation tokens [OK]
Hint: Split known roots, add ## for continuations [OK]
Common Mistakes:
  • Not splitting compound words like football
  • Missing ## prefix on continuation tokens
  • Treating all words as single tokens
4. Identify the error in this WordPiece tokenization output for the word 'unhappy': ["un", "happy"]
medium
A. Missing '##' prefix on 'happy' token
B. Incorrect splitting; 'unhappy' should be one token
C. Tokens should be reversed order
D. No error; this is correct tokenization

Solution

  1. Step 1: Check token continuation rules

    In WordPiece, tokens after the first must start with '##' to show continuation.
  2. Step 2: Analyze given tokens

    'happy' is a continuation of 'un', so it should be '##happy', not 'happy'.
  3. Final Answer:

    Missing '##' prefix on 'happy' token -> Option A
  4. Quick Check:

    Continuation tokens need '##' prefix [OK]
Hint: Check if continuation tokens start with '##' [OK]
Common Mistakes:
  • Forgetting '##' on continuation tokens
  • Assuming all tokens are standalone
  • Thinking order of tokens matters here
5. You want to tokenize the sentence "The unbreakable bond" using BERT's WordPiece tokenizer. Which tokenization output correctly handles the unknown word 'unbreakable'?
hard
A. ["The", "unbreakable", "bond"]
B. ["The", "un", "##break", "##able", "bond"]
C. ["The", "un", "breakable", "bond"]
D. ["The", "un", "##breakable", "bond"]

Solution

  1. Step 1: Understand unknown word handling

    WordPiece breaks unknown words into smaller known subwords with '##' for continuation.
  2. Step 2: Analyze 'unbreakable'

    It splits into 'un' + '##break' + '##able' to represent parts seen in vocabulary.
  3. Step 3: Check other tokens

    'The' and 'bond' are common words and remain as single tokens.
  4. Final Answer:

    ["The", "un", "##break", "##able", "bond"] -> Option B
  5. Quick Check:

    Unknown words split into known subwords with ## [OK]
Hint: Split unknown words into known parts with ## prefix [OK]
Common Mistakes:
  • Treating unknown words as single tokens
  • Missing ## on continuation tokens
  • Splitting without ## prefix on continuation