BERT tokenization breaks text into smaller pieces called tokens. This helps the model understand words and parts of words better.
BERT tokenization (WordPiece) in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokens = tokenizer.tokenize(text) ids = tokenizer.convert_tokens_to_ids(tokens)
tokenize(text) splits the input text into WordPiece tokens.
convert_tokens_to_ids(tokens) converts tokens into numbers BERT understands.
Examples
NLP
text = "playing" tokens = tokenizer.tokenize(text) print(tokens)
NLP
text = "unaffable" tokens = tokenizer.tokenize(text) print(tokens)
NLP
text = "hello world" tokens = tokenizer.tokenize(text) print(tokens)
Sample Model
This code shows how to split text into WordPiece tokens, convert them to IDs, and decode back to text using BERT tokenizer.
NLP
from transformers import BertTokenizer # Load BERT tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Sample text text = "Playing with BERT tokenization is fun!" # Tokenize text tokens = tokenizer.tokenize(text) print("Tokens:", tokens) # Convert tokens to IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) print("Token IDs:", token_ids) # Decode back to text decoded_text = tokenizer.decode(token_ids) print("Decoded text:", decoded_text)
Important Notes
WordPiece tokens starting with '##' mean they are parts of a word, not standalone.
BERT tokenizer lowercases text by default for 'bert-base-uncased'.
Token IDs are what BERT uses internally to understand text.
Summary
BERT tokenization splits words into smaller pieces called WordPieces.
This helps handle unknown words by breaking them into known parts.
Use BERT tokenizer to prepare text for BERT models correctly.
Practice
1. What is the main purpose of BERT's WordPiece tokenization?
easy
Solution
Step 1: Understand WordPiece tokenization
WordPiece breaks words into smaller parts called tokens, especially for unknown or rare words.Step 2: Identify the purpose of this splitting
This splitting helps the model recognize parts of words it has seen before, improving understanding.Final Answer:
To split words into smaller known pieces for better handling of unknown words -> Option AQuick Check:
WordPiece = splitting unknown words [OK]
Hint: WordPiece breaks unknown words into known parts [OK]
Common Mistakes:
- Thinking WordPiece translates text
- Confusing tokenization with stop word removal
- Assuming WordPiece directly converts text to numbers
2. Which of the following is the correct way to represent the word 'unaffable' using WordPiece tokens?
easy
Solution
Step 1: Understand WordPiece token format
WordPiece uses '##' to mark tokens that continue from a previous token.Step 2: Analyze the options
["un", "##aff", "##able"] correctly splits 'unaffable' into 'un' + '##aff' + '##able', showing continuation tokens.Final Answer:
["un", "##aff", "##able"] -> Option DQuick Check:
Continuation tokens start with ## [OK]
Hint: Look for '##' prefix on continuation tokens [OK]
Common Mistakes:
- Ignoring '##' prefix for continuation tokens
- Treating whole word as one token always
- Splitting tokens without '##' where needed
3. Given the sentence
"Playing football is fun", which is the correct WordPiece tokenization output?medium
Solution
Step 1: Tokenize 'Playing'
WordPiece splits 'Playing' into 'Play' and '##ing' because 'Play' is a known root.Step 2: Tokenize 'football'
It splits 'football' into 'foot' and '##ball' as common subwords.Step 3: Check remaining words
'is' and 'fun' are common words and remain as single tokens.Final Answer:
["Play", "##ing", "foot", "##ball", "is", "fun"] -> Option CQuick Check:
Known roots + ## continuation tokens [OK]
Hint: Split known roots, add ## for continuations [OK]
Common Mistakes:
- Not splitting compound words like football
- Missing ## prefix on continuation tokens
- Treating all words as single tokens
4. Identify the error in this WordPiece tokenization output for the word 'unhappy':
["un", "happy"]medium
Solution
Step 1: Check token continuation rules
In WordPiece, tokens after the first must start with '##' to show continuation.Step 2: Analyze given tokens
'happy' is a continuation of 'un', so it should be '##happy', not 'happy'.Final Answer:
Missing '##' prefix on 'happy' token -> Option AQuick Check:
Continuation tokens need '##' prefix [OK]
Hint: Check if continuation tokens start with '##' [OK]
Common Mistakes:
- Forgetting '##' on continuation tokens
- Assuming all tokens are standalone
- Thinking order of tokens matters here
5. You want to tokenize the sentence
"The unbreakable bond" using BERT's WordPiece tokenizer. Which tokenization output correctly handles the unknown word 'unbreakable'?hard
Solution
Step 1: Understand unknown word handling
WordPiece breaks unknown words into smaller known subwords with '##' for continuation.Step 2: Analyze 'unbreakable'
It splits into 'un' + '##break' + '##able' to represent parts seen in vocabulary.Step 3: Check other tokens
'The' and 'bond' are common words and remain as single tokens.Final Answer:
["The", "un", "##break", "##able", "bond"] -> Option BQuick Check:
Unknown words split into known subwords with ## [OK]
Hint: Split unknown words into known parts with ## prefix [OK]
Common Mistakes:
- Treating unknown words as single tokens
- Missing ## on continuation tokens
- Splitting without ## prefix on continuation
