NLPml~20 mins

Lemmatization in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Lemmatization

Problem:You want to clean text data by reducing words to their base form using lemmatization. The current process uses simple tokenization without lemmatization, causing many word forms to be treated as different words.

Current Metrics:Unique tokens before lemmatization: 1200; After tokenization only, model accuracy on text classification: 75%

Issue:The model struggles because many word forms are treated separately, increasing vocabulary size and noise.

Your Task

Apply lemmatization to reduce vocabulary size and improve model accuracy to at least 80%.

Use Python and NLTK or spaCy for lemmatization.

Keep the rest of the preprocessing and model architecture unchanged.

Hint 1

Hint 2

Hint 3

Solution

NLP

import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample text data and labels
texts = [
    'The cats are running faster',
    'A dog was running in the park',
    'He runs every morning',
    'They have run a marathon',
    'She is running late'
]
labels = [1, 0, 1, 0, 1]

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize a sentence

def lemmatize_sentence(sentence):
    words = nltk.word_tokenize(sentence)
    lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
    return ' '.join(lemmatized_words)

# Apply lemmatization
lemmatized_texts = [lemmatize_sentence(text) for text in texts]

# Vectorize texts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmatized_texts)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

# Vocabulary size
vocab_size = len(vectorizer.vocabulary_)

print(f'Vocabulary size after lemmatization: {vocab_size}')
print(f'Accuracy after lemmatization: {accuracy * 100:.2f}%')

Added lemmatization step using NLTK's WordNetLemmatizer with verb POS tagging.

Replaced original texts with lemmatized texts before vectorization.

Kept model and other preprocessing steps unchanged.

Results Interpretation

Before lemmatization: Vocabulary size = 20, Accuracy = 75%

After lemmatization: Vocabulary size = 19, Accuracy = 100%

Lemmatization reduces the number of unique words by grouping different forms of a word together. This helps the model learn better patterns and improves accuracy.

Bonus Experiment

Try using spaCy's lemmatizer instead of NLTK and compare the results.

💡 Hint

Use spaCy's 'en_core_web_sm' model and process texts with nlp.pipe for efficient lemmatization.

Practice

(1/5)

1. What is the main purpose of lemmatization in natural language processing?

easy

A. To find the base or dictionary form of a word

B. To count the frequency of words in a text

C. To translate text from one language to another

D. To remove stop words from a sentence

Lemmatization in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the goal of lemmatization

Step 2: Compare with other options

Final Answer:

Quick Check:

Solution

Step 1: Identify correct POS tag for adjective

Step 2: Check other POS tags

Final Answer:

Quick Check:

Solution

Step 1: Understand default POS in lemmatize()

Step 2: Lemmatize plural noun

Final Answer:

Quick Check:

Solution

Step 1: Check default POS in lemmatize()

Step 2: Analyze 'running' as noun

Final Answer:

Quick Check:

Solution

Step 1: Understand importance of POS tags in lemmatization

Step 2: Compare approaches

Final Answer:

Quick Check: