Practice

(1/5)

1. What does tokenization do in spaCy?

easy

A. It splits text into smaller pieces called tokens.

B. It trains a machine learning model.

C. It translates text into another language.

D. It visualizes text data.

Solution

Step 1: Understand tokenization concept
Tokenization means breaking text into smaller parts called tokens, like words or punctuation.
Step 2: Relate to spaCy functionality
spaCy uses tokenization to prepare text for analysis by splitting it into tokens.
Final Answer:
It splits text into smaller pieces called tokens. -> Option A
Quick Check:
Tokenization = splitting text [OK]

Hint: Tokenization means breaking text into tokens [OK]

Common Mistakes:

Confusing tokenization with training models
Thinking tokenization translates text
Assuming tokenization visualizes data

2. Which of the following is the correct way to load the English model in spaCy for tokenization?

easy

A. import spacy; nlp = spacy.tokenize('en')

B. import spacy; nlp = spacy.load('en_core_web_sm')

C. import spacy; nlp = spacy.model('english')

D. import spacy; nlp = spacy.load_model('english')

Solution

Step 1: Recall spaCy model loading syntax
spaCy loads models using spacy.load with the model name as a string.
Step 2: Identify correct model name and function
The English small model is 'en_core_web_sm' and loaded by spacy.load('en_core_web_sm').
Final Answer:
import spacy; nlp = spacy.load('en_core_web_sm') -> Option B
Quick Check:
Use spacy.load('model_name') to load models [OK]

Hint: Use spacy.load('model_name') to load models [OK]

Common Mistakes:

Using spacy.tokenize instead of spacy.load
Wrong model names like 'english' instead of 'en_core_web_sm'
Using non-existent functions like load_model

3. What will be the output tokens list from this code snippet?

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, world!')
tokens = [token.text for token in doc]
print(tokens)

medium

A. ['Hello', ',', 'world', '!']

B. ['Hello,', 'world!']

C. ['Hello world']

D. ['Hello', 'world!']

Solution

Step 1: Understand spaCy tokenization behavior
spaCy splits punctuation from words, so commas and exclamation marks become separate tokens.
Step 2: Analyze the given text 'Hello, world!'
Tokens will be 'Hello', ',', 'world', and '!' separately.
Final Answer:
['Hello', ',', 'world', '!'] -> Option A
Quick Check:
spaCy separates punctuation as tokens [OK]

Hint: Remember spaCy splits punctuation into separate tokens [OK]

Common Mistakes:

Keeping punctuation attached to words
Combining words into one token
Ignoring punctuation tokens

4. Identify the error in this spaCy tokenization code:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Test sentence.')
for token in doc:
print(token.text)

medium

A. The token.text attribute does not exist.

B. Wrong model name used in spacy.load.

C. Missing indentation for print inside the for loop.

D. The variable 'doc' is not defined.

Solution

Step 1: Check Python syntax for loops
Python requires the code inside a for loop to be indented properly.
Step 2: Inspect the given code
The print statement is not indented under the for loop, causing an IndentationError.
Final Answer:
Missing indentation for print inside the for loop. -> Option C
Quick Check:
Indent loop body code in Python [OK]

Hint: Indent code inside loops to avoid errors [OK]

Common Mistakes:

Ignoring Python indentation rules
Assuming model name is wrong
Thinking token.text is invalid

5. You want to tokenize a sentence but keep contractions like "don't" as one token using spaCy. Which approach is best?

hard

A. Use the default spaCy tokenizer without changes.

B. Split contractions manually after tokenization.

C. Replace contractions with full words before tokenization.

D. Modify the tokenizer exceptions to keep contractions as single tokens.

Solution

Step 1: Understand spaCy's default tokenizer behavior
By default, spaCy splits contractions like "don't" into two tokens: 'do' and "n't".
Step 2: Identify how to keep contractions as one token
Modifying tokenizer exceptions allows spaCy to treat contractions as single tokens.
Final Answer:
Modify the tokenizer exceptions to keep contractions as single tokens. -> Option D
Quick Check:
Customize tokenizer exceptions to control token splits [OK]

Hint: Change tokenizer exceptions to keep contractions whole [OK]

Common Mistakes:

Using default tokenizer expecting contractions as one token
Splitting contractions manually after tokenization
Replacing contractions before tokenization unnecessarily

Tokenization in spaCy in NLP - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand tokenization concept

Step 2: Relate to spaCy functionality

Final Answer:

Quick Check:

Solution

Step 1: Recall spaCy model loading syntax

Step 2: Identify correct model name and function

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy tokenization behavior

Step 2: Analyze the given text 'Hello, world!'

Final Answer:

Quick Check:

Solution

Step 1: Check Python syntax for loops

Step 2: Inspect the given code

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy's default tokenizer behavior

Step 2: Identify how to keep contractions as one token

Final Answer:

Quick Check: