Recall & Review
beginner
What is tokenization in spaCy?
Tokenization in spaCy is the process of breaking down text into smaller pieces called tokens, such as words, punctuation, or symbols, to help computers understand and analyze the text.
Click to reveal answer
intermediate
How does spaCy handle tokenization differently from simple splitting by spaces?
spaCy uses rules and machine learning to split text into tokens, considering punctuation, contractions, and special cases, rather than just splitting by spaces, which helps keep meaningful parts together.
Click to reveal answer
beginner
What Python code would you use to tokenize the sentence 'Hello, world!' using spaCy?
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, world!')
tokens = [token.text for token in doc]
print(tokens) # Output: ['Hello', ',', 'world', '!']Click to reveal answer
beginner
Why is tokenization important before other NLP tasks?
Tokenization breaks text into manageable pieces, making it easier for models to analyze meaning, find patterns, and perform tasks like translation, sentiment analysis, or named entity recognition.
Click to reveal answer
advanced
Can spaCy's tokenizer be customized? If yes, how?
Yes, spaCy's tokenizer can be customized by adding special cases, modifying rules, or changing how it splits tokens to better fit specific text types or languages.
Click to reveal answer
What does spaCy use to split text into tokens?
✗ Incorrect
spaCy uses a combination of rules and machine learning to accurately split text into tokens.
Which of these is NOT a token in the sentence 'Hello, world!' according to spaCy?
✗ Incorrect
'Hello world' is two tokens, not one; spaCy splits them into separate tokens.
Why might you want to customize spaCy's tokenizer?
✗ Incorrect
Customization helps spaCy handle special cases or unique text formats more accurately.
What Python function is used to load a spaCy language model for tokenization?
✗ Incorrect
spacy.load() loads a language model needed for tokenization and other NLP tasks.
Tokenization helps in which of the following NLP tasks?
✗ Incorrect
Tokenization prepares text for tasks like sentiment analysis by breaking it into meaningful parts.
Explain in your own words what tokenization in spaCy is and why it is useful.
Think about how breaking a sentence into words helps a computer understand it.
You got /3 concepts.
Describe how you would use spaCy to tokenize a sentence and get a list of tokens.
Remember the basic Python code to load a model and process text.
You got /4 concepts.