Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is tokenization in spaCy?
Tokenization in spaCy is the process of breaking down text into smaller pieces called tokens, such as words, punctuation, or symbols, to help computers understand and analyze the text.
Click to reveal answer
intermediate
How does spaCy handle tokenization differently from simple splitting by spaces?
spaCy uses rules and machine learning to split text into tokens, considering punctuation, contractions, and special cases, rather than just splitting by spaces, which helps keep meaningful parts together.
Click to reveal answer
beginner
What Python code would you use to tokenize the sentence 'Hello, world!' using spaCy?
Why is tokenization important before other NLP tasks?
Tokenization breaks text into manageable pieces, making it easier for models to analyze meaning, find patterns, and perform tasks like translation, sentiment analysis, or named entity recognition.
Click to reveal answer
advanced
Can spaCy's tokenizer be customized? If yes, how?
Yes, spaCy's tokenizer can be customized by adding special cases, modifying rules, or changing how it splits tokens to better fit specific text types or languages.
Click to reveal answer
What does spaCy use to split text into tokens?
AOnly spaces
BManual input
CRandom splitting
DRules and machine learning
✗ Incorrect
spaCy uses a combination of rules and machine learning to accurately split text into tokens.
Which of these is NOT a token in the sentence 'Hello, world!' according to spaCy?
AHello
BHello world
C,
Dworld
✗ Incorrect
'Hello world' is two tokens, not one; spaCy splits them into separate tokens.
Why might you want to customize spaCy's tokenizer?
ATo handle special text cases better
BTo make it slower
CTo remove punctuation always
DTo ignore all spaces
✗ Incorrect
Customization helps spaCy handle special cases or unique text formats more accurately.
What Python function is used to load a spaCy language model for tokenization?
Aspacy.load()
Bspacy.tokenize()
Cspacy.split()
Dspacy.model()
✗ Incorrect
spacy.load() loads a language model needed for tokenization and other NLP tasks.
Tokenization helps in which of the following NLP tasks?
AAudio processing
BImage recognition
CSentiment analysis
DVideo editing
✗ Incorrect
Tokenization prepares text for tasks like sentiment analysis by breaking it into meaningful parts.
Explain in your own words what tokenization in spaCy is and why it is useful.
Think about how breaking a sentence into words helps a computer understand it.
You got /3 concepts.
Describe how you would use spaCy to tokenize a sentence and get a list of tokens.
Remember the basic Python code to load a model and process text.
You got /4 concepts.
Practice
(1/5)
1. What does tokenization do in spaCy?
easy
A. It splits text into smaller pieces called tokens.
B. It trains a machine learning model.
C. It translates text into another language.
D. It visualizes text data.
Solution
Step 1: Understand tokenization concept
Tokenization means breaking text into smaller parts called tokens, like words or punctuation.
Step 2: Relate to spaCy functionality
spaCy uses tokenization to prepare text for analysis by splitting it into tokens.
Final Answer:
It splits text into smaller pieces called tokens. -> Option A
Quick Check:
Tokenization = splitting text [OK]
Hint: Tokenization means breaking text into tokens [OK]
Common Mistakes:
Confusing tokenization with training models
Thinking tokenization translates text
Assuming tokenization visualizes data
2. Which of the following is the correct way to load the English model in spaCy for tokenization?
easy
A. import spacy; nlp = spacy.tokenize('en')
B. import spacy; nlp = spacy.load('en_core_web_sm')
C. import spacy; nlp = spacy.model('english')
D. import spacy; nlp = spacy.load_model('english')
Solution
Step 1: Recall spaCy model loading syntax
spaCy loads models using spacy.load with the model name as a string.
Step 2: Identify correct model name and function
The English small model is 'en_core_web_sm' and loaded by spacy.load('en_core_web_sm').
Final Answer:
import spacy; nlp = spacy.load('en_core_web_sm') -> Option B
Quick Check:
Use spacy.load('model_name') to load models [OK]
Hint: Use spacy.load('model_name') to load models [OK]
Common Mistakes:
Using spacy.tokenize instead of spacy.load
Wrong model names like 'english' instead of 'en_core_web_sm'
Using non-existent functions like load_model
3. What will be the output tokens list from this code snippet?
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, world!')
tokens = [token.text for token in doc]
print(tokens)
medium
A. ['Hello', ',', 'world', '!']
B. ['Hello,', 'world!']
C. ['Hello world']
D. ['Hello', 'world!']
Solution
Step 1: Understand spaCy tokenization behavior
spaCy splits punctuation from words, so commas and exclamation marks become separate tokens.
Step 2: Analyze the given text 'Hello, world!'
Tokens will be 'Hello', ',', 'world', and '!' separately.
Final Answer:
['Hello', ',', 'world', '!'] -> Option A
Quick Check:
spaCy separates punctuation as tokens [OK]
Hint: Remember spaCy splits punctuation into separate tokens [OK]
Common Mistakes:
Keeping punctuation attached to words
Combining words into one token
Ignoring punctuation tokens
4. Identify the error in this spaCy tokenization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Test sentence.')
for token in doc:
print(token.text)
medium
A. The token.text attribute does not exist.
B. Wrong model name used in spacy.load.
C. Missing indentation for print inside the for loop.
D. The variable 'doc' is not defined.
Solution
Step 1: Check Python syntax for loops
Python requires the code inside a for loop to be indented properly.
Step 2: Inspect the given code
The print statement is not indented under the for loop, causing an IndentationError.
Final Answer:
Missing indentation for print inside the for loop. -> Option C
Quick Check:
Indent loop body code in Python [OK]
Hint: Indent code inside loops to avoid errors [OK]
Common Mistakes:
Ignoring Python indentation rules
Assuming model name is wrong
Thinking token.text is invalid
5. You want to tokenize a sentence but keep contractions like "don't" as one token using spaCy. Which approach is best?
hard
A. Use the default spaCy tokenizer without changes.
B. Split contractions manually after tokenization.
C. Replace contractions with full words before tokenization.
D. Modify the tokenizer exceptions to keep contractions as single tokens.