0
0
NLPml~15 mins

BERT tokenization (WordPiece) in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - BERT tokenization (WordPiece)
Problem:You want to tokenize sentences using BERT's WordPiece tokenizer to prepare text data for a BERT model.
Current Metrics:Tokenization is done but the tokens do not match expected WordPiece tokens, causing poor model input quality.
Issue:The current tokenization uses a simple whitespace split instead of WordPiece, leading to incorrect subword tokens and poor model understanding.
Your Task
Replace the simple whitespace tokenizer with BERT's WordPiece tokenizer and verify that tokenization matches expected WordPiece tokens.
Use the Hugging Face transformers library's BertTokenizer.
Do not change the input sentences.
Ensure the output tokens include subword tokens starting with '##' where appropriate.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from transformers import BertTokenizer

# Initialize the BERT WordPiece tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample sentences
sentences = [
    "Playing football is fun.",
    "Unbelievable! This is amazing."
]

# Tokenize using simple whitespace split (incorrect)
simple_tokens = [sentence.split() for sentence in sentences]

# Tokenize using BERT WordPiece tokenizer (correct)
wordpiece_tokens = [tokenizer.tokenize(sentence) for sentence in sentences]

print("Simple tokens:", simple_tokens)
print("WordPiece tokens:", wordpiece_tokens)
Replaced simple whitespace split tokenization with BertTokenizer from Hugging Face.
Used 'bert-base-uncased' pretrained tokenizer to get WordPiece tokens.
Used tokenizer.tokenize() method to get subword tokens with '##' prefix where needed.
Results Interpretation

Before: Tokenization splits sentences by spaces, e.g., 'Playing football' -> ['Playing', 'football'].

After: WordPiece tokenization splits words into subwords, e.g., 'Playing' -> ['play', '##ing'], capturing word structure better.

Using BERT's WordPiece tokenizer breaks words into meaningful subword units, which helps the model understand rare or complex words better and improves overall model input quality.
Bonus Experiment
Try tokenizing sentences with out-of-vocabulary words or misspellings and observe how WordPiece handles them.
💡 Hint
Use sentences with made-up words or typos and see how WordPiece breaks them into known subwords or characters.