Which of the following best describes subword tokenization in natural language processing?
Think about how tokenization helps handle unknown or rare words by breaking them down.
Subword tokenization breaks words into smaller meaningful parts, allowing models to understand rare or new words by their components.
What is the output of the following Python code using a simple whitespace tokenizer?
text = "Machine learning is fun" tokens = text.split() print(tokens)
Remember what the split() method does by default.
The split() method without arguments splits the string at spaces, producing a list of words.
You want to train a language model on a large dataset with many rare words. Which vocabulary size is best to balance coverage and model size?
Think about how subword tokenization helps with rare words and model efficiency.
A moderate vocabulary with subword tokens balances coverage of rare words and keeps the model size manageable.
You have a tokenizer vocabulary of 10,000 tokens. After tokenizing a test set of 1,000 words, 50 words are split into multiple tokens. What is the approximate tokenization coverage percentage?
Coverage means how many words are represented as single tokens.
If 50 words are split, then 950 words are covered as single tokens. Coverage = (950/1000)*100 = 95%.
What error does the following code raise when trying to tokenize text using a vocabulary dictionary?
vocab = {"hello": 1, "world": 2}
text = "hello unknown world"
tokens = [vocab[word] for word in text.split()]
print(tokens)Check what happens when a word is not found in the dictionary keys.
Accessing a dictionary with a key that does not exist raises a KeyError in Python.