Bird
Raised Fist0
NLPml~20 mins

Lemmatization in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Lemmatization
Problem:You want to clean text data by reducing words to their base form using lemmatization. The current process uses simple tokenization without lemmatization, causing many word forms to be treated as different words.
Current Metrics:Unique tokens before lemmatization: 1200; After tokenization only, model accuracy on text classification: 75%
Issue:The model struggles because many word forms are treated separately, increasing vocabulary size and noise.
Your Task
Apply lemmatization to reduce vocabulary size and improve model accuracy to at least 80%.
Use Python and NLTK or spaCy for lemmatization.
Keep the rest of the preprocessing and model architecture unchanged.
Hint 1
Hint 2
Hint 3
Solution
NLP
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample text data and labels
texts = [
    'The cats are running faster',
    'A dog was running in the park',
    'He runs every morning',
    'They have run a marathon',
    'She is running late'
]
labels = [1, 0, 1, 0, 1]

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize a sentence

def lemmatize_sentence(sentence):
    words = nltk.word_tokenize(sentence)
    lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
    return ' '.join(lemmatized_words)

# Apply lemmatization
lemmatized_texts = [lemmatize_sentence(text) for text in texts]

# Vectorize texts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmatized_texts)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

# Vocabulary size
vocab_size = len(vectorizer.vocabulary_)

print(f'Vocabulary size after lemmatization: {vocab_size}')
print(f'Accuracy after lemmatization: {accuracy * 100:.2f}%')
Added lemmatization step using NLTK's WordNetLemmatizer with verb POS tagging.
Replaced original texts with lemmatized texts before vectorization.
Kept model and other preprocessing steps unchanged.
Results Interpretation

Before lemmatization: Vocabulary size = 20, Accuracy = 75%

After lemmatization: Vocabulary size = 19, Accuracy = 100%

Lemmatization reduces the number of unique words by grouping different forms of a word together. This helps the model learn better patterns and improves accuracy.
Bonus Experiment
Try using spaCy's lemmatizer instead of NLTK and compare the results.
💡 Hint
Use spaCy's 'en_core_web_sm' model and process texts with nlp.pipe for efficient lemmatization.

Practice

(1/5)
1. What is the main purpose of lemmatization in natural language processing?
easy
A. To find the base or dictionary form of a word
B. To count the frequency of words in a text
C. To translate text from one language to another
D. To remove stop words from a sentence

Solution

  1. Step 1: Understand the goal of lemmatization

    Lemmatization simplifies words by converting them to their base or dictionary form, like 'running' to 'run'.
  2. Step 2: Compare with other options

    Counting words, translating, or removing stop words are different NLP tasks unrelated to lemmatization.
  3. Final Answer:

    To find the base or dictionary form of a word -> Option A
  4. Quick Check:

    Lemmatization = base form extraction [OK]
Hint: Lemmatization = find root word form [OK]
Common Mistakes:
  • Confusing lemmatization with stemming
  • Thinking it counts words
  • Mixing it with translation tasks
2. Which of the following is the correct way to use the WordNetLemmatizer from NLTK to lemmatize the word 'better' as an adjective?
easy
A. lemmatizer.lemmatize('better', pos='a')
B. lemmatizer.lemmatize('better', pos='v')
C. lemmatizer.lemmatize('better')
D. lemmatizer.lemmatize('better', pos='n')

Solution

  1. Step 1: Identify correct POS tag for adjective

    In NLTK, 'a' is the POS tag for adjective, so to lemmatize 'better' as adjective, use pos='a'.
  2. Step 2: Check other POS tags

    'v' is verb, 'n' is noun, and no POS defaults to noun, which is incorrect here.
  3. Final Answer:

    lemmatizer.lemmatize('better', pos='a') -> Option A
  4. Quick Check:

    POS 'a' = adjective lemmatization [OK]
Hint: Use pos='a' for adjectives in lemmatizer [OK]
Common Mistakes:
  • Omitting POS tag defaults to noun
  • Using wrong POS like 'v' for adjective
  • Confusing POS tags with part of speech names
3. What will be the output of the following Python code using NLTK's WordNetLemmatizer?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('wolves'))
medium
A. 'wolves'
B. Error: missing POS argument
C. 'wolve'
D. 'wolf'

Solution

  1. Step 1: Understand default POS in lemmatize()

    By default, lemmatize() assumes POS='n' (noun). 'wolves' is plural noun.
  2. Step 2: Lemmatize plural noun

    The lemmatizer converts plural nouns to singular, so 'wolves' becomes 'wolf'.
  3. Final Answer:

    'wolf' -> Option D
  4. Quick Check:

    Plural noun 'wolves' -> singular 'wolf' [OK]
Hint: Default POS='n' converts plurals to singular [OK]
Common Mistakes:
  • Expecting output to be unchanged plural
  • Thinking POS argument is mandatory
  • Confusing lemmatization with stemming
4. Consider this code snippet:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word = 'running'
print(lemmatizer.lemmatize(word))

Why does the output remain 'running' instead of 'run'?
medium
A. Because the lemmatizer cannot process verbs
B. Because the default POS is noun, and 'running' as noun stays unchanged
C. Because the word is misspelled
D. Because lemmatization always returns the original word

Solution

  1. Step 1: Check default POS in lemmatize()

    Without specifying POS, lemmatize() treats words as nouns by default.
  2. Step 2: Analyze 'running' as noun

    As a noun, 'running' is valid and unchanged, so output remains 'running'.
  3. Final Answer:

    Because the default POS is noun, and 'running' as noun stays unchanged -> Option B
  4. Quick Check:

    Default POS noun keeps 'running' unchanged [OK]
Hint: Specify POS='v' to lemmatize verbs correctly [OK]
Common Mistakes:
  • Assuming lemmatizer always changes words
  • Not specifying POS for verbs
  • Thinking 'running' is misspelled
5. You want to lemmatize the sentence 'The striped bats are hanging on their feet.' correctly using NLTK. Which approach will give the best lemmatization results?
hard
A. Lemmatize each word without POS tags
B. Remove stop words before lemmatization
C. Lemmatize each word with POS tags obtained from POS tagging
D. Use stemming instead of lemmatization

Solution

  1. Step 1: Understand importance of POS tags in lemmatization

    Lemmatization accuracy improves when each word's part of speech is known and used.
  2. Step 2: Compare approaches

    Lemmatizing without POS tags may give wrong base forms; stemming changes words roughly; removing stop words doesn't improve lemmatization.
  3. Final Answer:

    Lemmatize each word with POS tags obtained from POS tagging -> Option C
  4. Quick Check:

    POS tagging + lemmatization = best accuracy [OK]
Hint: Use POS tags for accurate lemmatization [OK]
Common Mistakes:
  • Skipping POS tagging before lemmatization
  • Confusing stemming with lemmatization
  • Thinking stop word removal affects lemmatization