Lemmatization is about finding the base form of words. The key metric here is Accuracy, which measures how many words are correctly converted to their base forms out of all words processed. This matters because correct base forms help many language tasks like search, translation, and understanding.
Lemmatization in spaCy in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
For lemmatization, a confusion matrix can show how many words were correctly lemmatized (True Positives) versus incorrectly lemmatized (False Positives and False Negatives). For example:
| Predicted Correct | Predicted Incorrect
------|-------------------|-------------------
Actual Correct | TP=85 | FN=15
Actual Incorrect | FP=10 | TN=90
Here, TP means words correctly lemmatized, FP means words wrongly lemmatized as correct, FN means words missed, and TN means words correctly identified as not needing change.
Precision tells us how many of the words we labeled as correct base forms really are correct. Recall tells us how many of the actual base forms we found.
For example, if we want to avoid wrong base forms (high precision), we might miss some correct ones (lower recall). If we want to find all base forms (high recall), we might include some wrong ones (lower precision).
In lemmatization, usually high precision is preferred to avoid confusing the meaning, but recall should not be too low to keep usefulness.
Good: Accuracy above 90%, Precision and Recall balanced above 85%. This means most words are correctly lemmatized and few mistakes happen.
Bad: Accuracy below 70%, Precision or Recall very low (below 50%). This means many words are wrongly lemmatized or many base forms are missed, hurting downstream tasks.
- Ignoring context: Some words need sentence context to lemmatize correctly. Metrics may look good on simple words but fail on complex sentences.
- Data leakage: Testing on words seen during training inflates accuracy.
- Overfitting: Model memorizes common words but fails on new words, causing poor real-world performance.
- Accuracy paradox: High accuracy can happen if many words don't need lemmatization, hiding poor performance on actual changes.
Your lemmatization model has 98% accuracy but only 12% recall on rare verb forms. Is it good for production? Why or why not?
Answer: No, it is not good. The high accuracy likely comes from many words that don't change, but the very low recall on rare verbs means the model misses most of these important cases. This hurts tasks relying on correct base forms of verbs.
Practice
Solution
Step 1: Understand the purpose of lemmatization
Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.Step 2: Compare options to definition
Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.Final Answer:
It finds the base or dictionary form of a word. -> Option CQuick Check:
Lemmatization = base form extraction [OK]
- Confusing lemmatization with token counting
- Thinking it translates text
- Mixing it up with punctuation removal
Solution
Step 1: Recall spaCy token attribute for lemma
spaCy uses the attributelemma_(with underscore) to get the lemma as a string.Step 2: Check each option
token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.Final Answer:
token.lemma_ -> Option AQuick Check:
spaCy lemma attribute = token.lemma_ [OK]
- Using token.lemma without underscore
- Trying to call a method like lemmatize()
- Using non-existent methods like get_lemma()
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running fast')
lemmas = [token.lemma_ for token in doc]What is the value of
lemmas?Solution
Step 1: Understand spaCy lemmatization output
spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.Step 2: Match the list of lemmas
['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].Final Answer:
['the', 'cat', 'be', 'run', 'fast'] -> Option DQuick Check:
spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
- Expecting original words instead of lemmas
- Not lowercasing lemmas
- Confusing verb forms like 'are' with 'is'
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('She was eating apples')
lemmas = [token.lemma for token in doc]
print(lemmas)Solution
Step 1: Check spaCy lemma attribute usage
spaCy tokens havelemma_(with underscore) for lemma string, notlemma.Step 2: Identify the error in code
The code usestoken.lemmawhich returns a property object, not the lemma string, causing wrong output.Final Answer:
Using token.lemma instead of token.lemma_ -> Option BQuick Check:
Use token.lemma_ for lemma string [OK]
- Using token.lemma without underscore
- Assuming spacy.load needs parentheses missing
- Thinking model name is wrong
Solution
Step 1: Understand the goal and spaCy usage
We want to count all tokens whose lemma is 'run', so we must usetoken.lemma_and compare to 'run'.Step 2: Analyze each option
import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly usestoken.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) usestoken.lemmawithout underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.Final Answer:
sum(token.lemma_ == 'run' for token in doc) -> Option AQuick Check:
Count lemma 'run' using token.lemma_ == 'run' [OK]
- Comparing token.text instead of token.lemma_
- Using token.lemma without underscore
- Comparing lemma to non-base form like 'running'
