0

NLPml~20 mins

Tokenization in spaCy in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

or

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

spaCy Tokenization Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

1:30remaining

Output of spaCy Tokenization Code

What is the output of the following code snippet that uses spaCy to tokenize a sentence?

NLP

import spacy
nlp = spacy.blank('en')
doc = nlp('Hello world! How are you?')
tokens = [token.text for token in doc]
print(tokens)

A['Hello', 'world', '!', 'How', 'are', 'you', '?']

B['Hello world!', 'How are you?']

C['Hello', 'world!', 'How', 'are', 'you?']

D['Hello', 'world', 'How', 'are', 'you']

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

spaCy Tokenizer Behavior on Contractions

Which statement correctly describes how spaCy tokenizes contractions like "don't" by default?

AIt removes the apostrophe and returns 'dont' as one token.

BIt keeps "don't" as a single token.

CIt splits "don't" into three tokens: 'do', 'n', and 't'.

DIt splits "don't" into two tokens: 'do' and 'n't'.

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Changing spaCy Tokenizer Behavior

Which spaCy component or method would you customize to change how tokens are split, for example to keep 'New York' as one token?

AModify the tokenizer exceptions or add special cases to the tokenizer.

BChange the pipeline's tagger component settings.

CAdjust the parser's dependency rules.

DModify the lemmatizer's dictionary.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Evaluating Tokenization Accuracy

You have a gold standard tokenization and a spaCy tokenizer output. Which metric best measures how well spaCy tokenized the text compared to the gold standard?

APerplexity of the tokenizer output.

BToken-level F1 score comparing spaCy tokens to gold tokens.

CSentence-level BLEU score.

DAccuracy of part-of-speech tags.

Attempts:

2 left

🔧 Debug

expert

2:30remaining

Identifying Tokenization Bug in spaCy Customization

You added a special case to spaCy's tokenizer to keep 'San Francisco' as one token, but after running, it still splits into two tokens. What is the most likely cause?

NLP

import spacy
from spacy.symbols import ORTH

nlp = spacy.blank('en')
special_case = [{ORTH: 'San Francisco'}]
nlp.tokenizer.add_special_case('San Francisco', special_case)
doc = nlp('I visited San Francisco last year.')
tokens = [token.text for token in doc]
print(tokens)

AThe ORTH symbol is incorrect; it should be LEMMA.

BThe tokenizer needs to be rebuilt after adding special cases.

CThe special case should be a list of dicts with separate tokens, not a single dict with the full phrase.

DThe blank model 'en' does not support special cases.

Attempts:

2 left

Practice

(1/5)

1. What does tokenization do in spaCy?

easy

A. It splits text into smaller pieces called tokens.

B. It trains a machine learning model.

C. It translates text into another language.

D. It visualizes text data.

Tokenization in spaCy in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand tokenization concept

Step 2: Relate to spaCy functionality

Final Answer:

Quick Check:

Solution

Step 1: Recall spaCy model loading syntax

Step 2: Identify correct model name and function

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy tokenization behavior

Step 2: Analyze the given text 'Hello, world!'

Final Answer:

Quick Check:

Solution

Step 1: Check Python syntax for loops

Step 2: Inspect the given code

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy's default tokenizer behavior

Step 2: Identify how to keep contractions as one token

Final Answer:

Quick Check: