Bird
Raised Fist0
NLPml~5 mins

Lemmatization in spaCy in NLP - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is lemmatization in natural language processing?
Lemmatization is the process of converting a word to its base or dictionary form, called a lemma. For example, 'running' becomes 'run'. It helps in understanding the meaning of words by grouping different forms of the same word.
Click to reveal answer
intermediate
How does spaCy perform lemmatization?
spaCy uses a built-in language model that includes rules and lookup tables to find the lemma of a word based on its context and part of speech. This helps spaCy return the correct base form of words during text processing.
Click to reveal answer
beginner
Which spaCy attribute gives the lemma of a token?
The attribute is token.lemma_. It returns the lemma as a string for each token in the processed text.
Click to reveal answer
intermediate
Why is lemmatization better than simple stemming?
Lemmatization returns real dictionary words as base forms, considering context and part of speech, while stemming just cuts word endings and may produce non-words. Lemmatization gives more accurate and meaningful results.
Click to reveal answer
beginner
Show a simple Python code snippet using spaCy to lemmatize the sentence: 'The cats are running quickly.'
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cats are running quickly.')
lemmas = [token.lemma_ for token in doc]
print(lemmas)

This prints: ['the', 'cat', 'be', 'run', 'quickly', '.']
Click to reveal answer
What does the spaCy attribute token.lemma_ return?
AThe word's frequency in the text
BThe part of speech tag
CThe original word text
DThe base form of the word
Which of these is a benefit of lemmatization over stemming?
ARemoves stop words automatically
BRuns faster than stemming
CProduces real dictionary words
DIgnores word context
In spaCy, what must you do before accessing token.lemma_?
ALoad a language model and process text with <code>nlp()</code>
BManually define lemmas for each word
CCall a separate lemmatization function
DNothing, it works on raw text
What is the lemma of the word 'running' in spaCy's default English model?
Aran
Brun
Crunning
Drunner
Which spaCy model is commonly used for English lemmatization?
Aen_core_web_sm
Bfr_core_news_sm
Cde_core_news_sm
Dxx_ent_wiki_sm
Explain what lemmatization is and how spaCy helps perform it.
Think about how spaCy finds the base form of words using its models.
You got /4 concepts.
    Write a short Python code example using spaCy to lemmatize a sentence and print the lemmas.
    Use nlp() to process text and a list comprehension to get lemmas.
    You got /5 concepts.

      Practice

      (1/5)
      1. What does lemmatization do in natural language processing using spaCy?
      easy
      A. It removes all punctuation from the text.
      B. It counts the number of words in a sentence.
      C. It finds the base or dictionary form of a word.
      D. It translates text into another language.

      Solution

      1. Step 1: Understand the purpose of lemmatization

        Lemmatization simplifies words by converting them to their base form, like 'running' to 'run'.
      2. Step 2: Compare options to definition

        Only It finds the base or dictionary form of a word. correctly describes finding the base or dictionary form of a word.
      3. Final Answer:

        It finds the base or dictionary form of a word. -> Option C
      4. Quick Check:

        Lemmatization = base form extraction [OK]
      Hint: Lemmatization = find base word form [OK]
      Common Mistakes:
      • Confusing lemmatization with token counting
      • Thinking it translates text
      • Mixing it up with punctuation removal
      2. Which of the following is the correct way to get the lemma of a token in spaCy?
      easy
      A. token.lemma_
      B. token.lemma
      C. token.lemmatize()
      D. token.get_lemma()

      Solution

      1. Step 1: Recall spaCy token attribute for lemma

        spaCy uses the attribute lemma_ (with underscore) to get the lemma as a string.
      2. Step 2: Check each option

        token.lemma_ matches the correct attribute. token.lemma, token.lemmatize(), and token.get_lemma() are not valid spaCy syntax.
      3. Final Answer:

        token.lemma_ -> Option A
      4. Quick Check:

        spaCy lemma attribute = token.lemma_ [OK]
      Hint: Use token.lemma_ with underscore for lemma string [OK]
      Common Mistakes:
      • Using token.lemma without underscore
      • Trying to call a method like lemmatize()
      • Using non-existent methods like get_lemma()
      3. Given the code snippet:
      import spacy
      nlp = spacy.load('en_core_web_sm')
      doc = nlp('The cats are running fast')
      lemmas = [token.lemma_ for token in doc]

      What is the value of lemmas?
      medium
      A. ['the', 'cats', 'are', 'running', 'fast']
      B. ['The', 'cats', 'are', 'running', 'fast']
      C. ['The', 'cat', 'is', 'run', 'fast']
      D. ['the', 'cat', 'be', 'run', 'fast']

      Solution

      1. Step 1: Understand spaCy lemmatization output

        spaCy converts words to their base forms: 'cats' to 'cat', 'are' to 'be', 'running' to 'run', and lowercases 'The' to 'the'.
      2. Step 2: Match the list of lemmas

        ['the', 'cat', 'be', 'run', 'fast'] matches the expected lemmas: ['the', 'cat', 'be', 'run', 'fast'].
      3. Final Answer:

        ['the', 'cat', 'be', 'run', 'fast'] -> Option D
      4. Quick Check:

        spaCy lemma list = ['the', 'cat', 'be', 'run', 'fast'] [OK]
      Hint: Lemmas are base forms, usually lowercase [OK]
      Common Mistakes:
      • Expecting original words instead of lemmas
      • Not lowercasing lemmas
      • Confusing verb forms like 'are' with 'is'
      4. Identify the error in this spaCy lemmatization code:
      import spacy
      nlp = spacy.load('en_core_web_sm')
      doc = nlp('She was eating apples')
      lemmas = [token.lemma for token in doc]
      print(lemmas)
      medium
      A. Missing parentheses in spacy.load()
      B. Using token.lemma instead of token.lemma_
      C. Incorrect model name in spacy.load()
      D. Missing import for lemmatizer

      Solution

      1. Step 1: Check spaCy lemma attribute usage

        spaCy tokens have lemma_ (with underscore) for lemma string, not lemma.
      2. Step 2: Identify the error in code

        The code uses token.lemma which returns a property object, not the lemma string, causing wrong output.
      3. Final Answer:

        Using token.lemma instead of token.lemma_ -> Option B
      4. Quick Check:

        Use token.lemma_ for lemma string [OK]
      Hint: Remember underscore in token.lemma_ for lemma [OK]
      Common Mistakes:
      • Using token.lemma without underscore
      • Assuming spacy.load needs parentheses missing
      • Thinking model name is wrong
      5. You want to lemmatize a list of sentences and count how many times the lemma 'run' appears using spaCy. Which code snippet correctly does this?
      hard
      A. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count)
      B. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count)
      C. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count)
      D. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count)

      Solution

      1. Step 1: Understand the goal and spaCy usage

        We want to count all tokens whose lemma is 'run', so we must use token.lemma_ and compare to 'run'.
      2. Step 2: Analyze each option

        import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'run' for token in doc) print(count) correctly uses token.lemma_ == 'run'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.text == 'run' for token in doc) print(count) compares original text, missing 'running'. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma == 'run' for token in doc) print(count) uses token.lemma without underscore, which is incorrect. import spacy nlp = spacy.load('en_core_web_sm') sentences = ['I run daily', 'He is running fast'] count = 0 for sent in sentences: doc = nlp(sent) count += sum(token.lemma_ == 'running' for token in doc) print(count) compares lemma to 'running', which is not the base form.
      3. Final Answer:

        sum(token.lemma_ == 'run' for token in doc) -> Option A
      4. Quick Check:

        Count lemma 'run' using token.lemma_ == 'run' [OK]
      Hint: Compare token.lemma_ to base word for counting [OK]
      Common Mistakes:
      • Comparing token.text instead of token.lemma_
      • Using token.lemma without underscore
      • Comparing lemma to non-base form like 'running'