Bird
Raised Fist0
NLPml~20 mins

Lemmatization in spaCy in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Lemmatization in spaCy
Problem:You want to convert words in sentences to their base forms (lemmas) using spaCy. Currently, your code extracts lemmas but sometimes includes punctuation and stop words, which makes the output noisy.
Current Metrics:Accuracy of lemma extraction: 85% (manually checked on sample sentences). Output includes unwanted tokens like punctuation and stop words.
Issue:The model extracts lemmas correctly but does not filter out punctuation and stop words, reducing the quality of the lemmatized output.
Your Task
Improve the lemmatization output by filtering out punctuation and stop words, so the final list contains only meaningful lemmas.
Use spaCy's built-in features only (no external libraries).
Keep the lemmatization process efficient and simple.
Hint 1
Hint 2
Solution
NLP
import spacy

# Load the English model
nlp = spacy.load('en_core_web_sm')

# Sample text
text = "The striped bats are hanging on their feet for best"

# Process the text
doc = nlp(text)

# Extract lemmas filtering out punctuation and stop words
lemmas = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]

print(lemmas)
Added filtering to remove tokens that are punctuation using token.is_punct.
Added filtering to remove stop words using token.is_stop.
Extracted lemmas only from filtered tokens to get meaningful base forms.
Results Interpretation

Before: Lemmas included punctuation and stop words, e.g., ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good']

After: Lemmas filtered to remove punctuation and stop words, e.g., ['striped', 'bat', 'hang', 'foot', 'good']

Filtering tokens using spaCy's attributes like is_punct and is_stop helps clean lemmatization output, making it more useful for downstream tasks.
Bonus Experiment
Try lemmatizing a longer paragraph and count the frequency of each lemma after filtering.
💡 Hint
Use a Python dictionary or collections.Counter to count lemmas after filtering punctuation and stop words.

Practice

(1/5)
1. What does lemmatization do in natural language processing using spaCy?
easy
A. It removes all punctuation from the text.
B. It counts the number of words in a sentence.
C. It finds the base or dictionary form of a word.
D. It translates text into another language.

Solution

  1. Step 1: Understand the purpose of lemmatization

    Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.
  2. Step 2: Compare options to definition

    Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.
  3. Final Answer:

    It finds the base or dictionary form of a word. -> Option C
  4. Quick Check:

    Lemmatization = base form extraction [OK]
Hint: Lemmatization = find base word form [OK]
Common Mistakes:
  • Confusing lemmatization with token counting
  • Thinking it translates text
  • Mixing it up with punctuation removal
2. Which of the following is the correct way to get the lemma of a token in spaCy?
easy
A. token.lemma_
B. token.lemma
C. token.lemmatize()
D. token.get_lemma()

Solution

  1. Step 1: Recall spaCy token attribute for lemma

    spaCy uses the attribute lemma_ (with underscore) to get the lemma as a string.
  2. Step 2: Check each option

    token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.
  3. Final Answer:

    token.lemma_ -> Option A
  4. Quick Check:

    spaCy lemma attribute = token.lemma_ [OK]
Hint: Use token.lemma_ with underscore for lemma string [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Trying to call a method like lemmatize()
  • Using non-existent methods like get_lemma()
3. Given the code snippet:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running fast')
lemmas = [token.lemma_ for token in doc]

What is the value of lemmas?
medium
A. ['the', 'cats', 'are', 'running', 'fast']
B. ['The', 'cats', 'are', 'running', 'fast']
C. ['The', 'cat', 'is', 'run', 'fast']
D. ['the', 'cat', 'be', 'run', 'fast']

Solution

  1. Step 1: Understand spaCy lemmatization output

    spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.
  2. Step 2: Match the list of lemmas

    ['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].
  3. Final Answer:

    ['the', 'cat', 'be', 'run', 'fast'] -> Option D
  4. Quick Check:

    spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
Hint: Lemmas are base forms, usually lowercase [OK]
Common Mistakes:
  • Expecting original words instead of lemmas
  • Not lowercasing lemmas
  • Confusing verb forms like 'are' with 'is'
4. Identify the error in this spaCy lemmatization code:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('She was eating apples')
lemmas = [token.lemma for token in doc]
print(lemmas)
medium
A. Missing parentheses in spacy.load()
B. Using token.lemma instead of token.lemma_
C. Incorrect model name in spacy.load()
D. Missing import for lemmatizer

Solution

  1. Step 1: Check spaCy lemma attribute usage

    spaCy tokens have lemma_ (with underscore) for lemma string, not lemma.
  2. Step 2: Identify the error in code

    The code uses token.lemma which returns a property object, not the lemma string, causing wrong output.
  3. Final Answer:

    Using token.lemma instead of token.lemma_ -> Option B
  4. Quick Check:

    Use token.lemma_ for lemma string [OK]
Hint: Remember underscore in token.lemma_ for lemma [OK]
Common Mistakes:
  • Using token.lemma without underscore
  • Assuming spacy.load needs parentheses missing
  • Thinking model name is wrong
5. You want to lemmatize a list of sentences and count how many times the lemma 'run' appears using spaCy. Which code snippet correctly does this?
hard
A. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count)
B. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count)
C. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count)
D. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count)

Solution

  1. Step 1: Understand the goal and spaCy usage

    We want to count all tokens whose lemma is 'run', so we must use token.lemma_ and compare to 'run'.
  2. Step 2: Analyze each option

    import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly uses token.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) uses token.lemma without underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.
  3. Final Answer:

    sum(token.lemma_ == 'run' for token in doc) -> Option A
  4. Quick Check:

    Count lemma 'run' using token.lemma_ == 'run' [OK]
Hint: Compare token.lemma_ to base word for counting [OK]
Common Mistakes:
  • Comparing token.text instead of token.lemma_
  • Using token.lemma without underscore
  • Comparing lemma to non-base form like 'running'