0
0
NLPml~5 mins

BERT tokenization (WordPiece) in NLP

Choose your learning style9 modes available
Introduction

BERT tokenization breaks text into smaller pieces called tokens. This helps the model understand words and parts of words better.

When preparing text data for BERT-based models.
When you want to handle unknown or rare words by splitting them into known parts.
When you need consistent tokenization that matches BERT's training.
When working with tasks like text classification, question answering, or named entity recognition using BERT.
Syntax
NLP
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

tokenize(text) splits the input text into WordPiece tokens.

convert_tokens_to_ids(tokens) converts tokens into numbers BERT understands.

Examples
This splits 'playing' into ['play', '##ing'] showing WordPiece splits suffixes.
NLP
text = "playing"
tokens = tokenizer.tokenize(text)
print(tokens)
Unknown words get split into known pieces like ['un', '##aff', '##able'].
NLP
text = "unaffable"
tokens = tokenizer.tokenize(text)
print(tokens)
Simple words stay whole: ['hello', 'world'].
NLP
text = "hello world"
tokens = tokenizer.tokenize(text)
print(tokens)
Sample Model

This code shows how to split text into WordPiece tokens, convert them to IDs, and decode back to text using BERT tokenizer.

NLP
from transformers import BertTokenizer

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample text
text = "Playing with BERT tokenization is fun!"

# Tokenize text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print("Decoded text:", decoded_text)
OutputSuccess
Important Notes

WordPiece tokens starting with '##' mean they are parts of a word, not standalone.

BERT tokenizer lowercases text by default for 'bert-base-uncased'.

Token IDs are what BERT uses internally to understand text.

Summary

BERT tokenization splits words into smaller pieces called WordPieces.

This helps handle unknown words by breaking them into known parts.

Use BERT tokenizer to prepare text for BERT models correctly.