Bird
Raised Fist0
NLPml~20 mins

Lemmatization in spaCy in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Lemmatization Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this spaCy lemmatization code?
Given the following code snippet using spaCy, what will be the printed list of lemmas?
NLP
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The striped bats are hanging on their feet for best')
lemmas = [token.lemma_ for token in doc]
print(lemmas)
A['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good']
B['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'feet', 'for', 'best']
C['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
D['the', 'striped', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Attempts:
2 left
💡 Hint
Look at how spaCy converts plural nouns and verbs to their base forms.
Model Choice
intermediate
1:30remaining
Which spaCy model is best for accurate lemmatization?
You want to perform lemmatization on English text with good accuracy and speed. Which spaCy model should you choose?
Aen_core_web_sm (small model)
Ben_vectors_web_lg (only word vectors, no lemmatization)
Cen_core_web_lg (large model)
Den_core_web_md (medium model)
Attempts:
2 left
💡 Hint
Larger models usually have better linguistic features for tasks like lemmatization.
Metrics
advanced
1:30remaining
Which metric best evaluates lemmatization quality?
You have a dataset with gold-standard lemmas and your spaCy model's predicted lemmas. Which metric best measures lemmatization accuracy?
AExact match accuracy
BRecall
CPrecision
DF1 score
Attempts:
2 left
💡 Hint
Lemmatization is about exact word form matches.
🔧 Debug
advanced
2:00remaining
Why does this spaCy lemmatization code raise an error?
Consider this code snippet: import spacy nlp = spacy.load('en_core_web_sm') text = 'Cats running fast' doc = nlp(text) lemmas = [token.lemma for token in doc] print(lemmas) Why does it raise an AttributeError?
A'doc' object is not iterable error
B'nlp' object is not callable error due to missing parentheses
C'text' variable is not defined before use
D'Token' object has no attribute 'lemma' because the correct attribute is 'lemma_'
Attempts:
2 left
💡 Hint
Check the attribute name for token lemmas in spaCy.
🧠 Conceptual
expert
2:30remaining
Why might spaCy lemmatization keep 'feet' as 'feet' instead of 'foot'?
In spaCy, the word 'feet' is lemmatized as 'feet' instead of the expected singular 'foot'. What is the most likely reason?
AspaCy's lemmatizer uses a dictionary-based approach that sometimes keeps irregular plurals unchanged
BThe lemmatizer relies on part-of-speech tags and 'feet' is tagged as plural noun but lemmatizer lacks irregular plural rules
CThe model's vocabulary does not include 'foot' so it cannot lemmatize 'feet' correctly
DspaCy treats 'feet' as a plural noun but does not normalize irregular plurals to singular
Attempts:
2 left
💡 Hint
Think about how dictionary-based lemmatizers handle irregular forms.

Practice

(1/5)
1. What does lemmatization do in natural language processing using spaCy?
easy
A. It removes all punctuation from the text.
B. It counts the number of words in a sentence.
C. It finds the base or dictionary form of a word.
D. It translates text into another language.

Solution

  1. Step 1: Understand the purpose of lemmatization

    Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.
  2. Step 2: Compare options to definition

    Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.
  3. Final Answer:

    It finds the base or dictionary form of a word. -> Option C
  4. Quick Check:

    Lemmatization = base form extraction [OK]
Hint: Lemmatization = find base word form [OK]
Common Mistakes:
  • Confusing lemmatization with token counting
  • Thinking it translates text
  • Mixing it up with punctuation removal
2. Which of the following is the correct way to get the lemma of a token in spaCy?
easy
A. token.lemma_
B. token.lemma
C. token.lemmatize()
D. token.get_lemma()

Solution

  1. Step 1: Recall spaCy token attribute for lemma

    spaCy uses the attribute lemma_ (with underscore) to get the lemma as a string.
  2. Step 2: Check each option

    token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.
  3. Final Answer:

    token.lemma_ -> Option A
  4. Quick Check:

    spaCy lemma attribute = token.lemma_ [OK]
Hint: Use token.lemma_ with underscore for lemma string [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Trying to call a method like lemmatize()
  • Using non-existent methods like get_lemma()
3. Given the code snippet:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running fast')
lemmas = [token.lemma_ for token in doc]

What is the value of lemmas?
medium
A. ['the', 'cats', 'are', 'running', 'fast']
B. ['The', 'cats', 'are', 'running', 'fast']
C. ['The', 'cat', 'is', 'run', 'fast']
D. ['the', 'cat', 'be', 'run', 'fast']

Solution

  1. Step 1: Understand spaCy lemmatization output

    spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.
  2. Step 2: Match the list of lemmas

    ['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].
  3. Final Answer:

    ['the', 'cat', 'be', 'run', 'fast'] -> Option D
  4. Quick Check:

    spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
Hint: Lemmas are base forms, usually lowercase [OK]
Common Mistakes:
  • Expecting original words instead of lemmas
  • Not lowercasing lemmas
  • Confusing verb forms like 'are' with 'is'
4. Identify the error in this spaCy lemmatization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('She was eating apples')
lemmas = [token.lemma for token in doc]
print(lemmas)
medium
A. Missing parentheses in spacy.load()
B. Using token.lemma instead of token.lemma_
C. Incorrect model name in spacy.load()
D. Missing import for lemmatizer

Solution

  1. Step 1: Check spaCy lemma attribute usage

    spaCy tokens have lemma_ (with underscore) for lemma string, not lemma.
  2. Step 2: Identify the error in code

    The code uses token.lemma which returns a property object, not the lemma string, causing wrong output.
  3. Final Answer:

    Using token.lemma instead of token.lemma_ -> Option B
  4. Quick Check:

    Use token.lemma_ for lemma string [OK]
Hint: Remember underscore in token.lemma_ for lemma [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Assuming spacy.load needs parentheses missing
  • Thinking model name is wrong
5. You want to lemmatize a list of sentences and count how many times the lemma 'run' appears using spaCy. Which code snippet correctly does this?
hard
A. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count)
B. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count)
C. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count)
D. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count)

Solution

  1. Step 1: Understand the goal and spaCy usage

    We want to count all tokens whose lemma is 'run', so we must use token.lemma_ and compare to 'run'.
  2. Step 2: Analyze each option

    import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly uses token.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) uses token.lemma without underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.
  3. Final Answer:

    sum(token.lemma_ == 'run' for token in doc) -> Option A
  4. Quick Check:

    Count lemma 'run' using token.lemma_ == 'run' [OK]
Hint: Compare token.lemma_ to base word for counting [OK]
Common Mistakes:
  • Comparing token.text instead of token.lemma_
  • Using token.lemma without underscore
  • Comparing lemma to non-base form like 'running'