Lemmatization helps find the base form of words. It makes text easier to analyze by treating different forms of a word as one.
0
0
Lemmatization in spaCy in NLP
Introduction
When you want to count how often a word appears, ignoring its different forms.
When you need to compare words in their simplest form for search or matching.
When cleaning text data before training a language model.
When analyzing text to find the main meaning without extra word endings.
Syntax
NLP
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('running runs ran') lemmas = [token.lemma_ for token in doc]
Use token.lemma_ to get the base form (lemma) of each word.
Make sure to load a spaCy language model like en_core_web_sm before lemmatization.
Examples
This example shows lemmatization of plural and verb forms.
NLP
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('cats are running') lemmas = [token.lemma_ for token in doc] print(lemmas)
Lemmatization also handles irregular forms like comparative and superlative adjectives.
NLP
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('better best good') lemmas = [token.lemma_ for token in doc] print(lemmas)
Sample Model
This program loads spaCy's English model, processes a sentence, and prints the base forms of each word.
NLP
import spacy # Load English model nlp = spacy.load('en_core_web_sm') # Text with different word forms text = 'The children are playing and played in the playground.' doc = nlp(text) # Extract lemmas lemmas = [token.lemma_ for token in doc] print('Original text:', text) print('Lemmatized tokens:', lemmas)
OutputSuccess
Important Notes
Lemmatization depends on the word's context, so spaCy uses part-of-speech tags to get accurate lemmas.
Stop words like 'the' keep their lemma as is because they are already base forms.
Summary
Lemmatization finds the base form of words to simplify text analysis.
Use token.lemma_ in spaCy after loading a language model.
It helps treat different word forms as the same word for better understanding.