When BERT's WordPiece tokenizer encounters a word not in its vocabulary, what does it do?
Think about how WordPiece tries to represent words using pieces it knows.
WordPiece splits unknown words into smaller subwords that exist in its vocabulary. If no subwords match, it uses the [UNK] token.
Given the input sentence "unaffable", what is the output token list from BERT's WordPiece tokenizer?
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokens = tokenizer.tokenize('unaffable') print(tokens)
Look for subwords starting with ## indicating continuation pieces.
WordPiece splits "unaffable" into "un", "##aff", and "##able" tokens, where "##" marks subword pieces continuing a word.
You want to fine-tune BERT on medical text with many rare terms. Which tokenizer approach is best?
Think about how to handle many rare or new words effectively.
Training a new WordPiece tokenizer on domain-specific text helps capture rare terms as meaningful subwords, improving tokenization quality.
You compare two tokenizers on the same text. Tokenizer A produces 100 tokens; Tokenizer B produces 130 tokens. Which statement is true about tokenization efficiency?
Fewer tokens usually mean less computation and simpler input.
Fewer tokens mean less input length, which usually leads to faster processing and less memory use, so Tokenizer A is more efficient.
Consider this code snippet:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(12345)
print(tokens)What error does this code raise?
Check the input type expected by the tokenizer.
The tokenizer expects a string input. Passing an integer causes a TypeError.