Lemmatization helps find the base form of words. It makes text easier to analyze by treating different forms of a word as one.
Lemmatization in spaCy in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('running runs ran') lemmas = [token.lemma_ for token in doc]
Use token.lemma_ to get the base form (lemma) of each word.
Make sure to load a spaCy language model like en_core_web_sm before lemmatization.
Examples
NLP
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('cats are running') lemmas = [token.lemma_ for token in doc] print(lemmas)
NLP
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('better best good') lemmas = [token.lemma_ for token in doc] print(lemmas)
Sample Model
This program loads spaCy's English model, processes a sentence, and prints the base forms of each word.
NLP
import spacy # Load English model nlp = spacy.load('en_core_web_sm') # Text with different word forms text = 'The children are playing and played in the playground.' doc = nlp(text) # Extract lemmas lemmas = [token.lemma_ for token in doc] print('Original text:', text) print('Lemmatized tokens:', lemmas)
Important Notes
Lemmatization depends on the word's context, so spaCy uses part-of-speech tags to get accurate lemmas.
Stop words like 'the' keep their lemma as is because they are already base forms.
Summary
Lemmatization finds the base form of words to simplify text analysis.
Use token.lemma_ in spaCy after loading a language model.
It helps treat different word forms as the same word for better understanding.
Practice
1. What does lemmatization do in natural language processing using spaCy?
easy
Solution
Step 1: Understand the purpose of lemmatization
Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.Step 2: Compare options to definition
Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.Final Answer:
It finds the base or dictionary form of a word. -> Option CQuick Check:
Lemmatization = base form extraction [OK]
Hint: Lemmatization = find base word form [OK]
Common Mistakes:
- Confusing lemmatization with token counting
- Thinking it translates text
- Mixing it up with punctuation removal
2. Which of the following is the correct way to get the lemma of a token in spaCy?
easy
Solution
Step 1: Recall spaCy token attribute for lemma
spaCy uses the attributelemma_(with underscore) to get the lemma as a string.Step 2: Check each option
token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.Final Answer:
token.lemma_ -> Option AQuick Check:
spaCy lemma attribute = token.lemma_ [OK]
Hint: Use token.lemma_ with underscore for lemma string [OK]
Common Mistakes:
- Using token.lemma without underscore
- Trying to call a method like lemmatize()
- Using non-existent methods like get_lemma()
3. Given the code snippet:
What is the value of
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running fast')
lemmas = [token.lemma_ for token in doc]What is the value of
lemmas?medium
Solution
Step 1: Understand spaCy lemmatization output
spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.Step 2: Match the list of lemmas
['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].Final Answer:
['the', 'cat', 'be', 'run', 'fast'] -> Option DQuick Check:
spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
Hint: Lemmas are base forms, usually lowercase [OK]
Common Mistakes:
- Expecting original words instead of lemmas
- Not lowercasing lemmas
- Confusing verb forms like 'are' with 'is'
4. Identify the error in this spaCy lemmatization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('She was eating apples')
lemmas = [token.lemma for token in doc]
print(lemmas)medium
Solution
Step 1: Check spaCy lemma attribute usage
spaCy tokens havelemma_(with underscore) for lemma string, notlemma.Step 2: Identify the error in code
The code usestoken.lemmawhich returns a property object, not the lemma string, causing wrong output.Final Answer:
Using token.lemma instead of token.lemma_ -> Option BQuick Check:
Use token.lemma_ for lemma string [OK]
Hint: Remember underscore in token.lemma_ for lemma [OK]
Common Mistakes:
- Using token.lemma without underscore
- Assuming spacy.load needs parentheses missing
- Thinking model name is wrong
5. You want to lemmatize a list of sentences and count how many times the lemma 'run' appears using spaCy. Which code snippet correctly does this?
hard
Solution
Step 1: Understand the goal and spaCy usage
We want to count all tokens whose lemma is 'run', so we must usetoken.lemma_and compare to 'run'.Step 2: Analyze each option
import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly usestoken.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) usestoken.lemmawithout underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.Final Answer:
sum(token.lemma_ == 'run' for token in doc) -> Option AQuick Check:
Count lemma 'run' using token.lemma_ == 'run' [OK]
Hint: Compare token.lemma_ to base word for counting [OK]
Common Mistakes:
- Comparing token.text instead of token.lemma_
- Using token.lemma without underscore
- Comparing lemma to non-base form like 'running'
