0
0
NLPml~15 mins

BERT tokenization (WordPiece) in NLP - Deep Dive

Choose your learning style9 modes available
Overview - BERT tokenization (WordPiece)
What is it?
BERT tokenization using WordPiece is a method to split text into smaller pieces called tokens. These tokens can be whole words or parts of words. This helps BERT understand and process language better, especially for rare or new words. It breaks down text so the model can learn patterns from smaller, meaningful chunks.
Why it matters
Without WordPiece tokenization, BERT would struggle with words it has never seen before, making it hard to understand new or rare words. This would limit its ability to work well on real-world language, which is full of new terms, misspellings, or mixed languages. WordPiece helps BERT handle this variety smoothly, improving its accuracy and usefulness in many applications like search, translation, and chatbots.
Where it fits
Before learning BERT tokenization, you should understand basic text processing and why machines need to break text into tokens. After this, you can learn about BERT’s model architecture and how it uses these tokens to understand language. Later, you can explore other tokenization methods and compare their strengths.
Mental Model
Core Idea
WordPiece tokenization breaks words into smaller known pieces so BERT can understand any word by combining these pieces.
Think of it like...
It's like building words with LEGO blocks: even if you don't have a block for the whole word, you can build it by snapping together smaller blocks you already have.
Text input → [WordPiece Tokenizer] → Tokens (whole words or subwords)

Example:
"unhappiness" → [un, ##happy, ##ness]

┌───────────────┐
│   Input Text  │
└──────┬────────┘
       │
       ▼
┌─────────────────────┐
│ WordPiece Tokenizer  │
└──────┬──────────────┘
       │
       ▼
┌───────────────┬───────────────┬───────────────┐
│     un        │   ##happy     │   ##ness      │
└───────────────┴───────────────┴───────────────┘
Build-Up - 7 Steps
1
FoundationWhat is Tokenization in NLP
🤔
Concept: Tokenization means splitting text into smaller pieces called tokens.
When computers read text, they can't understand sentences directly. So, we split sentences into words or smaller parts called tokens. For example, 'I love cats' becomes ['I', 'love', 'cats']. This is the first step in processing language.
Result
Text is split into tokens that a computer can work with.
Understanding tokenization is essential because all language models start by breaking text into manageable pieces.
2
FoundationWhy Simple Word Tokenization Fails
🤔
Concept: Splitting text only by spaces misses parts of words and unknown words.
If we split only by spaces, words like 'unhappiness' stay whole. But if the model never saw 'unhappiness' before, it can't understand it. Also, new words or typos won't match known words, causing problems.
Result
Simple tokenization leads to many unknown words and poor understanding.
Knowing the limits of simple tokenization shows why more advanced methods like WordPiece are needed.
3
IntermediateHow WordPiece Tokenization Works
🤔
Concept: WordPiece breaks words into smaller known subwords using a vocabulary built from training data.
WordPiece starts with a vocabulary of common words and subwords. When it sees a new word, it tries to split it into the longest known pieces. For example, 'unhappiness' becomes 'un', '##happy', '##ness'. The '##' means the piece is attached to the previous one.
Result
Words are split into smaller parts that the model knows, reducing unknown tokens.
Understanding WordPiece's splitting helps explain how BERT can handle new or rare words gracefully.
4
IntermediateBuilding the WordPiece Vocabulary
🤔Before reading on: do you think WordPiece vocabulary contains only whole words or also parts of words? Commit to your answer.
Concept: The vocabulary is created by finding the most frequent subword pieces in the training text.
WordPiece vocabulary is built by starting with all characters and then merging the most common pairs of characters or subwords step by step. This process continues until the vocabulary reaches a set size. This way, common words stay whole, and rare words are split into meaningful parts.
Result
A vocabulary that balances whole words and subwords, enabling flexible tokenization.
Knowing how the vocabulary is built reveals why WordPiece can represent many words efficiently with a limited set of tokens.
5
IntermediateTokenizing Text with WordPiece Step-by-Step
🤔Before reading on: when tokenizing 'playing', do you think WordPiece will split it into 'play' + '##ing' or keep it whole? Commit to your answer.
Concept: WordPiece tokenizes by greedily matching the longest known subwords from left to right.
To tokenize 'playing', WordPiece looks for the longest prefix in the vocabulary. It finds 'play' then looks at the rest 'ing' and finds '##ing'. So, 'playing' becomes ['play', '##ing']. This greedy approach ensures tokens are as large as possible.
Result
Text is split into meaningful subwords that the model can understand.
Understanding the greedy matching process clarifies how tokenization balances between whole words and subwords.
6
AdvancedHandling Unknown and Rare Words
🤔Before reading on: do you think WordPiece can always split any word into known tokens? Commit to your answer.
Concept: WordPiece can break down any word into characters if needed, ensuring no unknown tokens remain.
If a word is very rare or new, WordPiece breaks it down into smaller and smaller pieces until it reaches single characters. This guarantees every word can be represented, avoiding unknown tokens that confuse the model.
Result
No unknown tokens appear, improving model robustness on new text.
Knowing WordPiece's fallback to characters explains why BERT can handle any input text without errors.
7
ExpertImpact of WordPiece on BERT’s Performance
🤔Before reading on: does WordPiece tokenization improve or reduce BERT’s understanding of language? Commit to your answer.
Concept: WordPiece tokenization balances vocabulary size and coverage, affecting BERT’s accuracy and efficiency.
A smaller vocabulary means fewer tokens but more splitting, which can slow training and reduce context. A larger vocabulary means more whole words but needs more memory. WordPiece finds a middle ground, improving BERT’s ability to learn language patterns while keeping the model efficient.
Result
BERT achieves strong language understanding with manageable model size and training time.
Understanding this tradeoff helps experts tune tokenization for better model performance in real applications.
Under the Hood
WordPiece tokenization uses a greedy longest-match-first algorithm to split input text into subword tokens from a fixed vocabulary. The vocabulary is built by iteratively merging frequent character pairs into subwords during training. At runtime, the tokenizer scans the input left to right, matching the longest possible subword in the vocabulary. If no match is found, it falls back to single characters. Each subword token is mapped to an embedding vector that BERT uses as input. This process allows BERT to represent any word as a sequence of known subwords, enabling it to handle rare or unseen words without unknown tokens.
Why designed this way?
WordPiece was designed to solve the problem of unknown words in language models while keeping vocabulary size manageable. Earlier methods either used full words, leading to huge vocabularies and many unknowns, or characters, losing semantic meaning. WordPiece balances these by using subwords, capturing meaningful parts of words and reducing unknowns. The greedy longest-match algorithm is simple and efficient, making tokenization fast. Alternatives like byte-pair encoding are similar but differ in vocabulary building details. WordPiece became popular with BERT because it fits well with transformer models and their input embedding layers.
Input Text
   │
   ▼
┌─────────────────────┐
│  WordPiece Tokenizer │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────────────┐
│  Greedy Longest Match Search │
└─────────┬───────────────────┘
          │
          ▼
┌───────────────┬───────────────┐
│ Known Subword │ Unknown?      │
│ Found?        │               │
└──────┬────────┴──────┬────────┘
       │               │
       ▼               ▼
┌─────────────┐   ┌─────────────┐
│ Emit Token  │   │ Break to    │
│             │   │ Characters  │
└─────────────┘   └─────────────┘
          │               │
          └───────┬───────┘
                  ▼
           Token Sequence
                  │
                  ▼
           BERT Embeddings
Myth Busters - 4 Common Misconceptions
Quick: Does WordPiece always split words into single characters? Commit to yes or no.
Common Belief:WordPiece always splits words down to single characters.
Tap to reveal reality
Reality:WordPiece tries to split words into the longest known subwords first and only breaks down to single characters if no larger subwords match.
Why it matters:Believing it always splits to characters leads to underestimating the semantic information preserved in tokens, which affects how you interpret model inputs.
Quick: Is WordPiece vocabulary just a list of whole words? Commit to yes or no.
Common Belief:WordPiece vocabulary contains only whole words.
Tap to reveal reality
Reality:WordPiece vocabulary contains both whole words and subword pieces, including prefixes, suffixes, and common fragments.
Why it matters:Thinking vocabulary is only whole words causes confusion about how rare or new words are handled and why token counts vary.
Quick: Does WordPiece tokenization depend on language grammar rules? Commit to yes or no.
Common Belief:WordPiece tokenization uses grammar rules to split words.
Tap to reveal reality
Reality:WordPiece tokenization is purely statistical, based on frequency of subword pieces in training data, not grammar or meaning.
Why it matters:Assuming grammar rules are used can mislead learners about tokenization errors and limit understanding of its flexibility across languages.
Quick: Does a larger WordPiece vocabulary always improve model accuracy? Commit to yes or no.
Common Belief:Increasing vocabulary size always makes BERT more accurate.
Tap to reveal reality
Reality:Larger vocabulary can improve accuracy but also increases model size and training complexity; there's a tradeoff that WordPiece balances.
Why it matters:Ignoring this tradeoff can lead to inefficient models that are too large or slow without meaningful accuracy gains.
Expert Zone
1
WordPiece tokenization can affect downstream tasks differently; for example, splitting named entities may reduce interpretability but improve generalization.
2
The '##' prefix in tokens is a convention indicating the token is a continuation, which helps BERT learn word boundaries internally.
3
WordPiece vocabulary is fixed after training, so adapting to new domains requires careful vocabulary extension or retraining.
When NOT to use
WordPiece is less effective for languages with very different morphology or scripts, such as agglutinative languages or those without clear word boundaries; alternatives like SentencePiece or character-level tokenization may be better.
Production Patterns
In production, WordPiece tokenization is often combined with caching tokenized inputs for efficiency, and vocabulary is sometimes customized for domain-specific language to improve accuracy.
Connections
Byte Pair Encoding (BPE)
Similar subword tokenization method with different vocabulary building rules.
Understanding WordPiece helps grasp BPE since both break words into subwords to handle unknown words, but differ in how they merge subwords.
Morse Code
Both encode complex information into smaller, reusable units for efficient communication.
Knowing how WordPiece breaks words into subwords is like how Morse code breaks letters into dots and dashes, enabling flexible and compact representation.
Data Compression Algorithms
WordPiece vocabulary building resembles compression by merging frequent patterns to reduce size.
Seeing WordPiece as a compression method clarifies why it balances vocabulary size and coverage to efficiently represent language.
Common Pitfalls
#1Treating WordPiece tokens as independent words in analysis.
Wrong approach:tokens = ['un', '##happy', '##ness'] print('Number of words:', len(tokens)) # Outputs 3
Correct approach:tokens = ['un', '##happy', '##ness'] words = 1 # Because these tokens form one word print('Number of words:', words) # Outputs 1
Root cause:Misunderstanding that WordPiece tokens can be subword parts of a single word, not separate words.
#2Using WordPiece tokenizer vocabulary from one language on another language.
Wrong approach:tokenizer = WordPieceTokenizer(vocab='english_vocab.txt') tokens = tokenizer.tokenize('こんにちは') # Japanese text
Correct approach:tokenizer = WordPieceTokenizer(vocab='japanese_vocab.txt') tokens = tokenizer.tokenize('こんにちは')
Root cause:Assuming one vocabulary fits all languages ignores language-specific subword patterns.
#3Ignoring the '##' prefix and treating all tokens as standalone.
Wrong approach:tokens = ['play', 'ing'] # Missing '##' prefix print('Tokens:', tokens)
Correct approach:tokens = ['play', '##ing'] # Correct WordPiece tokens print('Tokens:', tokens)
Root cause:Not recognizing the '##' prefix indicates token continuation, which is important for reconstructing words.
Key Takeaways
WordPiece tokenization breaks words into smaller known pieces to help BERT understand any word, including rare or new ones.
It uses a fixed vocabulary built from frequent subword patterns, balancing vocabulary size and coverage.
The tokenizer greedily matches the longest subwords first, falling back to characters only if necessary.
This method avoids unknown tokens, improving BERT’s robustness and accuracy on diverse text.
Understanding WordPiece’s design and tradeoffs is key to effectively using and tuning BERT models.