BERT tokenization breaks text into smaller pieces called tokens. This helps the model understand words and parts of words better.
0
0
BERT tokenization (WordPiece) in NLP
Introduction
When preparing text data for BERT-based models.
When you want to handle unknown or rare words by splitting them into known parts.
When you need consistent tokenization that matches BERT's training.
When working with tasks like text classification, question answering, or named entity recognition using BERT.
Syntax
NLP
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokens = tokenizer.tokenize(text) ids = tokenizer.convert_tokens_to_ids(tokens)
tokenize(text) splits the input text into WordPiece tokens.
convert_tokens_to_ids(tokens) converts tokens into numbers BERT understands.
Examples
This splits 'playing' into ['play', '##ing'] showing WordPiece splits suffixes.
NLP
text = "playing" tokens = tokenizer.tokenize(text) print(tokens)
Unknown words get split into known pieces like ['un', '##aff', '##able'].
NLP
text = "unaffable" tokens = tokenizer.tokenize(text) print(tokens)
Simple words stay whole: ['hello', 'world'].
NLP
text = "hello world" tokens = tokenizer.tokenize(text) print(tokens)
Sample Model
This code shows how to split text into WordPiece tokens, convert them to IDs, and decode back to text using BERT tokenizer.
NLP
from transformers import BertTokenizer # Load BERT tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Sample text text = "Playing with BERT tokenization is fun!" # Tokenize text tokens = tokenizer.tokenize(text) print("Tokens:", tokens) # Convert tokens to IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) print("Token IDs:", token_ids) # Decode back to text decoded_text = tokenizer.decode(token_ids) print("Decoded text:", decoded_text)
OutputSuccess
Important Notes
WordPiece tokens starting with '##' mean they are parts of a word, not standalone.
BERT tokenizer lowercases text by default for 'bert-base-uncased'.
Token IDs are what BERT uses internally to understand text.
Summary
BERT tokenization splits words into smaller pieces called WordPieces.
This helps handle unknown words by breaking them into known parts.
Use BERT tokenizer to prepare text for BERT models correctly.