Bird
Raised Fist0
NLPml~12 mins

Lemmatization in spaCy in NLP - Model Pipeline Trace

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - Lemmatization in spaCy

This pipeline shows how spaCy processes text to find the base form of words, called lemmas. It starts with raw text, breaks it into words, and then finds each word's lemma to help understand the meaning better.

Data Flow - 4 Stages
1Raw Text Input
1 sentence (string)Input raw sentence as text1 sentence (string)
"The cats are running quickly."
2Tokenization
1 sentence (string)Split sentence into words (tokens)6 tokens (words)
["The", "cats", "are", "running", "quickly", "."]
3Part-of-Speech Tagging
6 tokensAssign word types (noun, verb, etc.)6 tokens with POS tags
[('The', 'DET'), ('cats', 'NOUN'), ('are', 'AUX'), ('running', 'VERB'), ('quickly', 'ADV'), ('.', 'PUNCT')]
4Lemmatization
6 tokens with POS tagsFind base form (lemma) of each token6 lemmas
["the", "cat", "be", "run", "quickly", "."]
Training Trace - Epoch by Epoch
Loss
0.5 |****
0.4 |***
0.3 |**
0.2 |*
0.1 | 
     1 2 3 4 5 Epochs
EpochLoss ↓Accuracy ↑Observation
10.450.70Initial training with moderate loss and accuracy.
20.300.82Loss decreased and accuracy improved as model learned.
30.200.90Model shows good convergence with higher accuracy.
40.150.93Further improvement, loss lowering steadily.
50.120.95Training converged with high accuracy and low loss.
Prediction Trace - 3 Layers
Layer 1: Tokenization
Layer 2: POS Tagging
Layer 3: Lemmatization
Model Quiz - 3 Questions
Test your understanding
What is the main purpose of lemmatization in spaCy?
ATo find the base form of words
BTo split sentences into words
CTo assign part-of-speech tags
DTo translate text into another language
Key Insight
Lemmatization helps reduce different word forms to a common base, improving text understanding. The POS tags guide the model to choose the correct lemma. Training shows steady improvement, meaning the model learns to lemmatize accurately.

Practice

(1/5)
1. What does lemmatization do in natural language processing using spaCy?
easy
A. It removes all punctuation from the text.
B. It counts the number of words in a sentence.
C. It finds the base or dictionary form of a word.
D. It translates text into another language.

Solution

  1. Step 1: Understand the purpose of lemmatization

    Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.
  2. Step 2: Compare options to definition

    Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.
  3. Final Answer:

    It finds the base or dictionary form of a word. -> Option C
  4. Quick Check:

    Lemmatization = base form extraction [OK]
Hint: Lemmatization = find base word form [OK]
Common Mistakes:
  • Confusing lemmatization with token counting
  • Thinking it translates text
  • Mixing it up with punctuation removal
2. Which of the following is the correct way to get the lemma of a token in spaCy?
easy
A. token.lemma_
B. token.lemma
C. token.lemmatize()
D. token.get_lemma()

Solution

  1. Step 1: Recall spaCy token attribute for lemma

    spaCy uses the attribute lemma_ (with underscore) to get the lemma as a string.
  2. Step 2: Check each option

    token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.
  3. Final Answer:

    token.lemma_ -> Option A
  4. Quick Check:

    spaCy lemma attribute = token.lemma_ [OK]
Hint: Use token.lemma_ with underscore for lemma string [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Trying to call a method like lemmatize()
  • Using non-existent methods like get_lemma()
3. Given the code snippet:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running fast')
lemmas = [token.lemma_ for token in doc]

What is the value of lemmas?
medium
A. ['the', 'cats', 'are', 'running', 'fast']
B. ['The', 'cats', 'are', 'running', 'fast']
C. ['The', 'cat', 'is', 'run', 'fast']
D. ['the', 'cat', 'be', 'run', 'fast']

Solution

  1. Step 1: Understand spaCy lemmatization output

    spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.
  2. Step 2: Match the list of lemmas

    ['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].
  3. Final Answer:

    ['the', 'cat', 'be', 'run', 'fast'] -> Option D
  4. Quick Check:

    spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
Hint: Lemmas are base forms, usually lowercase [OK]
Common Mistakes:
  • Expecting original words instead of lemmas
  • Not lowercasing lemmas
  • Confusing verb forms like 'are' with 'is'
4. Identify the error in this spaCy lemmatization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('She was eating apples')
lemmas = [token.lemma for token in doc]
print(lemmas)
medium
A. Missing parentheses in spacy.load()
B. Using token.lemma instead of token.lemma_
C. Incorrect model name in spacy.load()
D. Missing import for lemmatizer

Solution

  1. Step 1: Check spaCy lemma attribute usage

    spaCy tokens have lemma_ (with underscore) for lemma string, not lemma.
  2. Step 2: Identify the error in code

    The code uses token.lemma which returns a property object, not the lemma string, causing wrong output.
  3. Final Answer:

    Using token.lemma instead of token.lemma_ -> Option B
  4. Quick Check:

    Use token.lemma_ for lemma string [OK]
Hint: Remember underscore in token.lemma_ for lemma [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Assuming spacy.load needs parentheses missing
  • Thinking model name is wrong
5. You want to lemmatize a list of sentences and count how many times the lemma 'run' appears using spaCy. Which code snippet correctly does this?
hard
A. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count)
B. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count)
C. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count)
D. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count)

Solution

  1. Step 1: Understand the goal and spaCy usage

    We want to count all tokens whose lemma is 'run', so we must use token.lemma_ and compare to 'run'.
  2. Step 2: Analyze each option

    import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly uses token.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) uses token.lemma without underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.
  3. Final Answer:

    sum(token.lemma_ == 'run' for token in doc) -> Option A
  4. Quick Check:

    Count lemma 'run' using token.lemma_ == 'run' [OK]
Hint: Compare token.lemma_ to base word for counting [OK]
Common Mistakes:
  • Comparing token.text instead of token.lemma_
  • Using token.lemma without underscore
  • Comparing lemma to non-base form like 'running'