Bird
Raised Fist0
NLPml~8 mins

Lemmatization in spaCy in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Lemmatization in spaCy
Which metric matters for Lemmatization in spaCy and WHY

Lemmatization is about finding the base form of words. The key metric here is Accuracy, which measures how many words are correctly converted to their base forms out of all words processed. This matters because correct base forms help many language tasks like search, translation, and understanding.

Confusion matrix for Lemmatization

For lemmatization, a confusion matrix can show how many words were correctly lemmatized (True Positives) versus incorrectly lemmatized (False Positives and False Negatives). For example:

          | Predicted Correct | Predicted Incorrect
    ------|-------------------|-------------------
    Actual Correct   |        TP=85       |       FN=15       
    Actual Incorrect |        FP=10       |       TN=90       
    

Here, TP means words correctly lemmatized, FP means words wrongly lemmatized as correct, FN means words missed, and TN means words correctly identified as not needing change.

Tradeoff: Precision vs Recall in Lemmatization

Precision tells us how many of the words we labeled as correct base forms really are correct. Recall tells us how many of the actual base forms we found.

For example, if we want to avoid wrong base forms (high precision), we might miss some correct ones (lower recall). If we want to find all base forms (high recall), we might include some wrong ones (lower precision).

In lemmatization, usually high precision is preferred to avoid confusing the meaning, but recall should not be too low to keep usefulness.

Good vs Bad metric values for Lemmatization

Good: Accuracy above 90%, Precision and Recall balanced above 85%. This means most words are correctly lemmatized and few mistakes happen.

Bad: Accuracy below 70%, Precision or Recall very low (below 50%). This means many words are wrongly lemmatized or many base forms are missed, hurting downstream tasks.

Common pitfalls in Lemmatization metrics
  • Ignoring context: Some words need sentence context to lemmatize correctly. Metrics may look good on simple words but fail on complex sentences.
  • Data leakage: Testing on words seen during training inflates accuracy.
  • Overfitting: Model memorizes common words but fails on new words, causing poor real-world performance.
  • Accuracy paradox: High accuracy can happen if many words don't need lemmatization, hiding poor performance on actual changes.
Self-check question

Your lemmatization model has 98% accuracy but only 12% recall on rare verb forms. Is it good for production? Why or why not?

Answer: No, it is not good. The high accuracy likely comes from many words that don't change, but the very low recall on rare verbs means the model misses most of these important cases. This hurts tasks relying on correct base forms of verbs.

Key Result
Accuracy is key for lemmatization, but balanced precision and recall ensure correct and complete base form detection.

Practice

(1/5)
1. What does lemmatization do in natural language processing using spaCy?
easy
A. It removes all punctuation from the text.
B. It counts the number of words in a sentence.
C. It finds the base or dictionary form of a word.
D. It translates text into another language.

Solution

  1. Step 1: Understand the purpose of lemmatization

    Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.
  2. Step 2: Compare options to definition

    Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.
  3. Final Answer:

    It finds the base or dictionary form of a word. -> Option C
  4. Quick Check:

    Lemmatization = base form extraction [OK]
Hint: Lemmatization = find base word form [OK]
Common Mistakes:
  • Confusing lemmatization with token counting
  • Thinking it translates text
  • Mixing it up with punctuation removal
2. Which of the following is the correct way to get the lemma of a token in spaCy?
easy
A. token.lemma_
B. token.lemma
C. token.lemmatize()
D. token.get_lemma()

Solution

  1. Step 1: Recall spaCy token attribute for lemma

    spaCy uses the attribute lemma_ (with underscore) to get the lemma as a string.
  2. Step 2: Check each option

    token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.
  3. Final Answer:

    token.lemma_ -> Option A
  4. Quick Check:

    spaCy lemma attribute = token.lemma_ [OK]
Hint: Use token.lemma_ with underscore for lemma string [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Trying to call a method like lemmatize()
  • Using non-existent methods like get_lemma()
3. Given the code snippet:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running fast')
lemmas = [token.lemma_ for token in doc]

What is the value of lemmas?
medium
A. ['the', 'cats', 'are', 'running', 'fast']
B. ['The', 'cats', 'are', 'running', 'fast']
C. ['The', 'cat', 'is', 'run', 'fast']
D. ['the', 'cat', 'be', 'run', 'fast']

Solution

  1. Step 1: Understand spaCy lemmatization output

    spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.
  2. Step 2: Match the list of lemmas

    ['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].
  3. Final Answer:

    ['the', 'cat', 'be', 'run', 'fast'] -> Option D
  4. Quick Check:

    spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
Hint: Lemmas are base forms, usually lowercase [OK]
Common Mistakes:
  • Expecting original words instead of lemmas
  • Not lowercasing lemmas
  • Confusing verb forms like 'are' with 'is'
4. Identify the error in this spaCy lemmatization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('She was eating apples')
lemmas = [token.lemma for token in doc]
print(lemmas)
medium
A. Missing parentheses in spacy.load()
B. Using token.lemma instead of token.lemma_
C. Incorrect model name in spacy.load()
D. Missing import for lemmatizer

Solution

  1. Step 1: Check spaCy lemma attribute usage

    spaCy tokens have lemma_ (with underscore) for lemma string, not lemma.
  2. Step 2: Identify the error in code

    The code uses token.lemma which returns a property object, not the lemma string, causing wrong output.
  3. Final Answer:

    Using token.lemma instead of token.lemma_ -> Option B
  4. Quick Check:

    Use token.lemma_ for lemma string [OK]
Hint: Remember underscore in token.lemma_ for lemma [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Assuming spacy.load needs parentheses missing
  • Thinking model name is wrong
5. You want to lemmatize a list of sentences and count how many times the lemma 'run' appears using spaCy. Which code snippet correctly does this?
hard
A. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count)
B. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count)
C. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count)
D. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count)

Solution

  1. Step 1: Understand the goal and spaCy usage

    We want to count all tokens whose lemma is 'run', so we must use token.lemma_ and compare to 'run'.
  2. Step 2: Analyze each option

    import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly uses token.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) uses token.lemma without underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.
  3. Final Answer:

    sum(token.lemma_ == 'run' for token in doc) -> Option A
  4. Quick Check:

    Count lemma 'run' using token.lemma_ == 'run' [OK]
Hint: Compare token.lemma_ to base word for counting [OK]
Common Mistakes:
  • Comparing token.text instead of token.lemma_
  • Using token.lemma without underscore
  • Comparing lemma to non-base form like 'running'