NLPml~15 mins

BERT tokenization (WordPiece) in NLP - Deep Dive

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - BERT tokenization (WordPiece)

What is it?

BERT tokenization using WordPiece is a method to split text into smaller pieces called tokens. These tokens can be whole words or parts of words. This helps BERT understand and process language better, especially for rare or new words. It breaks down text so the model can learn patterns from smaller, meaningful chunks.

Why it matters

Without WordPiece tokenization, BERT would struggle with words it has never seen before, making it hard to understand new or rare words. This would limit its ability to work well on real-world language, which is full of new terms, misspellings, or mixed languages. WordPiece helps BERT handle this variety smoothly, improving its accuracy and usefulness in many applications like search, translation, and chatbots.

Where it fits

Before learning BERT tokenization, you should understand basic text processing and why machines need to break text into tokens. After this, you can learn about BERT’s model architecture and how it uses these tokens to understand language. Later, you can explore other tokenization methods and compare their strengths.

Mental Model

Core Idea

WordPiece tokenization breaks words into smaller known pieces so BERT can understand any word by combining these pieces.

Think of it like...

It's like building words with LEGO blocks: even if you don't have a block for the whole word, you can build it by snapping together smaller blocks you already have.

Text input → [WordPiece Tokenizer] → Tokens (whole words or subwords)

Example:
"unhappiness" → [un, ##happy, ##ness]

┌───────────────┐
│   Input Text  │
└──────┬────────┘
       │
       ▼
┌─────────────────────┐
│ WordPiece Tokenizer  │
└──────┬──────────────┘
       │
       ▼
┌───────────────┬───────────────┬───────────────┐
│     un        │   ##happy     │   ##ness      │
└───────────────┴───────────────┴───────────────┘

Build-Up - 7 Steps

FoundationWhat is Tokenization in NLP

Concept: Tokenization means splitting text into smaller pieces called tokens.

When computers read text, they can't understand sentences directly. So, we split sentences into words or smaller parts called tokens. For example, 'I love cats' becomes ['I', 'love', 'cats']. This is the first step in processing language.

Result

Text is split into tokens that a computer can work with.

Understanding tokenization is essential because all language models start by breaking text into manageable pieces.

FoundationWhy Simple Word Tokenization Fails

IntermediateHow WordPiece Tokenization Works

IntermediateBuilding the WordPiece Vocabulary

IntermediateTokenizing Text with WordPiece Step-by-Step

AdvancedHandling Unknown and Rare Words

ExpertImpact of WordPiece on BERT’s Performance

Under the Hood

WordPiece tokenization uses a greedy longest-match-first algorithm to split input text into subword tokens from a fixed vocabulary. The vocabulary is built by iteratively merging frequent character pairs into subwords during training. At runtime, the tokenizer scans the input left to right, matching the longest possible subword in the vocabulary. If no match is found, it falls back to single characters. Each subword token is mapped to an embedding vector that BERT uses as input. This process allows BERT to represent any word as a sequence of known subwords, enabling it to handle rare or unseen words without unknown tokens.

Why designed this way?

WordPiece was designed to solve the problem of unknown words in language models while keeping vocabulary size manageable. Earlier methods either used full words, leading to huge vocabularies and many unknowns, or characters, losing semantic meaning. WordPiece balances these by using subwords, capturing meaningful parts of words and reducing unknowns. The greedy longest-match algorithm is simple and efficient, making tokenization fast. Alternatives like byte-pair encoding are similar but differ in vocabulary building details. WordPiece became popular with BERT because it fits well with transformer models and their input embedding layers.

Input Text
   │
   ▼
┌─────────────────────┐
│  WordPiece Tokenizer │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────────────┐
│  Greedy Longest Match Search │
└─────────┬───────────────────┘
          │
          ▼
┌───────────────┬───────────────┐
│ Known Subword │ Unknown?      │
│ Found?        │               │
└──────┬────────┴──────┬────────┘
       │               │
       ▼               ▼
┌─────────────┐   ┌─────────────┐
│ Emit Token  │   │ Break to    │
│             │   │ Characters  │
└─────────────┘   └─────────────┘
          │               │
          └───────┬───────┘
                  ▼
           Token Sequence
                  │
                  ▼
           BERT Embeddings

Myth Busters - 4 Common Misconceptions

Quick: Does WordPiece always split words into single characters? Commit to yes or no.

Common Belief:WordPiece always splits words down to single characters.

Tap to reveal reality

Quick: Is WordPiece vocabulary just a list of whole words? Commit to yes or no.

Common Belief:WordPiece vocabulary contains only whole words.

Tap to reveal reality

Quick: Does WordPiece tokenization depend on language grammar rules? Commit to yes or no.

Common Belief:WordPiece tokenization uses grammar rules to split words.

Tap to reveal reality

Quick: Does a larger WordPiece vocabulary always improve model accuracy? Commit to yes or no.

Common Belief:Increasing vocabulary size always makes BERT more accurate.

Tap to reveal reality

Expert Zone

WordPiece tokenization can affect downstream tasks differently; for example, splitting named entities may reduce interpretability but improve generalization.

The '##' prefix in tokens is a convention indicating the token is a continuation, which helps BERT learn word boundaries internally.

WordPiece vocabulary is fixed after training, so adapting to new domains requires careful vocabulary extension or retraining.

When NOT to use

WordPiece is less effective for languages with very different morphology or scripts, such as agglutinative languages or those without clear word boundaries; alternatives like SentencePiece or character-level tokenization may be better.

Production Patterns

In production, WordPiece tokenization is often combined with caching tokenized inputs for efficiency, and vocabulary is sometimes customized for domain-specific language to improve accuracy.

Connections

Byte Pair Encoding (BPE)

Similar subword tokenization method with different vocabulary building rules.

Understanding WordPiece helps grasp BPE since both break words into subwords to handle unknown words, but differ in how they merge subwords.

Morse Code

Both encode complex information into smaller, reusable units for efficient communication.

Knowing how WordPiece breaks words into subwords is like how Morse code breaks letters into dots and dashes, enabling flexible and compact representation.

Data Compression Algorithms

WordPiece vocabulary building resembles compression by merging frequent patterns to reduce size.

Seeing WordPiece as a compression method clarifies why it balances vocabulary size and coverage to efficiently represent language.

Common Pitfalls

#1Treating WordPiece tokens as independent words in analysis.

Wrong approach:tokens = ['un', '##happy', '##ness'] print('Number of words:', len(tokens)) # Outputs 3

Correct approach:tokens = ['un', '##happy', '##ness'] words = 1 # Because these tokens form one word print('Number of words:', words) # Outputs 1

Root cause:Misunderstanding that WordPiece tokens can be subword parts of a single word, not separate words.

#2Using WordPiece tokenizer vocabulary from one language on another language.

Wrong approach:tokenizer = WordPieceTokenizer(vocab='english_vocab.txt') tokens = tokenizer.tokenize('こんにちは') # Japanese text

Correct approach:tokenizer = WordPieceTokenizer(vocab='japanese_vocab.txt') tokens = tokenizer.tokenize('こんにちは')

Root cause:Assuming one vocabulary fits all languages ignores language-specific subword patterns.

#3Ignoring the '##' prefix and treating all tokens as standalone.

Wrong approach:tokens = ['play', 'ing'] # Missing '##' prefix print('Tokens:', tokens)

Correct approach:tokens = ['play', '##ing'] # Correct WordPiece tokens print('Tokens:', tokens)

Root cause:Not recognizing the '##' prefix indicates token continuation, which is important for reconstructing words.

Key Takeaways

WordPiece tokenization breaks words into smaller known pieces to help BERT understand any word, including rare or new ones.

It uses a fixed vocabulary built from frequent subword patterns, balancing vocabulary size and coverage.

The tokenizer greedily matches the longest subwords first, falling back to characters only if necessary.

This method avoids unknown tokens, improving BERT’s robustness and accuracy on diverse text.

Understanding WordPiece’s design and tradeoffs is key to effectively using and tuning BERT models.

Practice

(1/5)

1. What is the main purpose of BERT's WordPiece tokenization?

easy

A. To split words into smaller known pieces for better handling of unknown words

B. To translate text into another language

C. To remove stop words from sentences

D. To convert text into numerical vectors directly

BERT tokenization (WordPiece) in NLP - Deep Dive

Start learning this pattern below

Practice

Solution

Step 1: Understand WordPiece tokenization

Step 2: Identify the purpose of this splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand WordPiece token format

Step 2: Analyze the options

Final Answer:

Quick Check:

Solution

Step 1: Tokenize 'Playing'

Step 2: Tokenize 'football'

Step 3: Check remaining words

Final Answer:

Quick Check:

Solution

Step 1: Check token continuation rules

Step 2: Analyze given tokens

Final Answer:

Quick Check:

Solution

Step 1: Understand unknown word handling

Step 2: Analyze 'unbreakable'

Step 3: Check other tokens

Final Answer:

Quick Check: