0
0
NLPml~20 mins

Lemmatization in spaCy in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Lemmatization in spaCy
Problem:You want to convert words in sentences to their base forms (lemmas) using spaCy. Currently, your code extracts lemmas but sometimes includes punctuation and stop words, which makes the output noisy.
Current Metrics:Accuracy of lemma extraction: 85% (manually checked on sample sentences). Output includes unwanted tokens like punctuation and stop words.
Issue:The model extracts lemmas correctly but does not filter out punctuation and stop words, reducing the quality of the lemmatized output.
Your Task
Improve the lemmatization output by filtering out punctuation and stop words, so the final list contains only meaningful lemmas.
Use spaCy's built-in features only (no external libraries).
Keep the lemmatization process efficient and simple.
Hint 1
Hint 2
Solution
NLP
import spacy

# Load the English model
nlp = spacy.load('en_core_web_sm')

# Sample text
text = "The striped bats are hanging on their feet for best"

# Process the text
doc = nlp(text)

# Extract lemmas filtering out punctuation and stop words
lemmas = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]

print(lemmas)
Added filtering to remove tokens that are punctuation using token.is_punct.
Added filtering to remove stop words using token.is_stop.
Extracted lemmas only from filtered tokens to get meaningful base forms.
Results Interpretation

Before: Lemmas included punctuation and stop words, e.g., ['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good']

After: Lemmas filtered to remove punctuation and stop words, e.g., ['striped', 'bat', 'hang', 'foot', 'good']

Filtering tokens using spaCy's attributes like is_punct and is_stop helps clean lemmatization output, making it more useful for downstream tasks.
Bonus Experiment
Try lemmatizing a longer paragraph and count the frequency of each lemma after filtering.
💡 Hint
Use a Python dictionary or collections.Counter to count lemmas after filtering punctuation and stop words.