Challenge - 5 Problems
spaCy Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate1:30remaining
Output of spaCy Tokenization Code
What is the output of the following code snippet that uses spaCy to tokenize a sentence?
NLP
import spacy nlp = spacy.blank('en') doc = nlp('Hello world! How are you?') tokens = [token.text for token in doc] print(tokens)
Attempts:
2 left
💡 Hint
Think about how spaCy splits punctuation from words by default.
✗ Incorrect
spaCy tokenizes text by separating punctuation from words, so 'world!' becomes 'world' and '!' as separate tokens.
🧠 Conceptual
intermediate1:30remaining
spaCy Tokenizer Behavior on Contractions
Which statement correctly describes how spaCy tokenizes contractions like "don't" by default?
Attempts:
2 left
💡 Hint
Think about how spaCy handles common English contractions.
✗ Incorrect
spaCy splits contractions into base word and contraction part, so "don't" becomes 'do' and 'n't'.
❓ Hyperparameter
advanced2:00remaining
Changing spaCy Tokenizer Behavior
Which spaCy component or method would you customize to change how tokens are split, for example to keep 'New York' as one token?
Attempts:
2 left
💡 Hint
Token splitting is controlled before tagging or parsing.
✗ Incorrect
The tokenizer controls how text is split into tokens; adding special cases can keep multi-word expressions as one token.
❓ Metrics
advanced2:00remaining
Evaluating Tokenization Accuracy
You have a gold standard tokenization and a spaCy tokenizer output. Which metric best measures how well spaCy tokenized the text compared to the gold standard?
Attempts:
2 left
💡 Hint
Think about comparing sets of tokens for overlap.
✗ Incorrect
Token-level F1 score measures precision and recall of tokens compared to gold standard, suitable for tokenization evaluation.
🔧 Debug
expert2:30remaining
Identifying Tokenization Bug in spaCy Customization
You added a special case to spaCy's tokenizer to keep 'San Francisco' as one token, but after running, it still splits into two tokens. What is the most likely cause?
NLP
import spacy from spacy.symbols import ORTH nlp = spacy.blank('en') special_case = [{ORTH: 'San Francisco'}] nlp.tokenizer.add_special_case('San Francisco', special_case) doc = nlp('I visited San Francisco last year.') tokens = [token.text for token in doc] print(tokens)
Attempts:
2 left
💡 Hint
Check the format of the special case argument.
✗ Incorrect
Special cases require a list of token dicts, each with ORTH for each token; a single dict with full phrase is invalid.