0
0
NLPml~20 mins

Tokenization in spaCy in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
spaCy Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
1:30remaining
Output of spaCy Tokenization Code
What is the output of the following code snippet that uses spaCy to tokenize a sentence?
NLP
import spacy
nlp = spacy.blank('en')
doc = nlp('Hello world! How are you?')
tokens = [token.text for token in doc]
print(tokens)
A['Hello', 'world', '!', 'How', 'are', 'you', '?']
B['Hello world!', 'How are you?']
C['Hello', 'world!', 'How', 'are', 'you?']
D['Hello', 'world', 'How', 'are', 'you']
Attempts:
2 left
💡 Hint
Think about how spaCy splits punctuation from words by default.
🧠 Conceptual
intermediate
1:30remaining
spaCy Tokenizer Behavior on Contractions
Which statement correctly describes how spaCy tokenizes contractions like "don't" by default?
AIt removes the apostrophe and returns 'dont' as one token.
BIt keeps "don't" as a single token.
CIt splits "don't" into three tokens: 'do', 'n', and 't'.
DIt splits "don't" into two tokens: 'do' and 'n't'.
Attempts:
2 left
💡 Hint
Think about how spaCy handles common English contractions.
Hyperparameter
advanced
2:00remaining
Changing spaCy Tokenizer Behavior
Which spaCy component or method would you customize to change how tokens are split, for example to keep 'New York' as one token?
AModify the tokenizer exceptions or add special cases to the tokenizer.
BChange the pipeline's tagger component settings.
CAdjust the parser's dependency rules.
DModify the lemmatizer's dictionary.
Attempts:
2 left
💡 Hint
Token splitting is controlled before tagging or parsing.
Metrics
advanced
2:00remaining
Evaluating Tokenization Accuracy
You have a gold standard tokenization and a spaCy tokenizer output. Which metric best measures how well spaCy tokenized the text compared to the gold standard?
APerplexity of the tokenizer output.
BToken-level F1 score comparing spaCy tokens to gold tokens.
CSentence-level BLEU score.
DAccuracy of part-of-speech tags.
Attempts:
2 left
💡 Hint
Think about comparing sets of tokens for overlap.
🔧 Debug
expert
2:30remaining
Identifying Tokenization Bug in spaCy Customization
You added a special case to spaCy's tokenizer to keep 'San Francisco' as one token, but after running, it still splits into two tokens. What is the most likely cause?
NLP
import spacy
from spacy.symbols import ORTH

nlp = spacy.blank('en')
special_case = [{ORTH: 'San Francisco'}]
nlp.tokenizer.add_special_case('San Francisco', special_case)
doc = nlp('I visited San Francisco last year.')
tokens = [token.text for token in doc]
print(tokens)
AThe ORTH symbol is incorrect; it should be LEMMA.
BThe tokenizer needs to be rebuilt after adding special cases.
CThe special case should be a list of dicts with separate tokens, not a single dict with the full phrase.
DThe blank model 'en' does not support special cases.
Attempts:
2 left
💡 Hint
Check the format of the special case argument.