NLPml~20 mins

Stemming (Porter, Snowball) in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Stemming (Porter, Snowball)

Problem:You want to reduce words to their root form to improve text analysis. Currently, you use the Porter stemmer but notice inconsistent stemming results affecting your text classification accuracy.

Current Metrics:Text classification accuracy with Porter stemmer: 78%

Issue:The Porter stemmer sometimes over-stems or under-stems words, causing noisy features and limiting model accuracy.

Your Task

Improve text classification accuracy by using a better stemming method while keeping preprocessing time reasonable.

You must use either Porter or Snowball stemmer from NLTK.

Do not change the classification model or dataset.

Keep preprocessing code simple and efficient.

Hint 1

Hint 2

Hint 3

Solution

NLP

import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample dataset
texts = [
    'running runs runner',
    'easily easier easiest',
    'cats cat catlike',
    'fishing fished fishes',
    'happily happiness happy'
]
labels = [1, 0, 1, 0, 1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.4, random_state=42)

# Define stemmers
porter = PorterStemmer()
snowball = SnowballStemmer('english')

# Function to stem texts
def stem_texts(texts, stemmer):
    stemmed_texts = []
    for text in texts:
        words = text.split()
        stemmed_words = [stemmer.stem(word) for word in words]
        stemmed_texts.append(' '.join(stemmed_words))
    return stemmed_texts

# Using Porter stemmer
X_train_porter = stem_texts(X_train, porter)
X_test_porter = stem_texts(X_test, porter)

vectorizer_porter = CountVectorizer()
X_train_vec_porter = vectorizer_porter.fit_transform(X_train_porter)
X_test_vec_porter = vectorizer_porter.transform(X_test_porter)

model = MultinomialNB()
model.fit(X_train_vec_porter, y_train)
y_pred_porter = model.predict(X_test_vec_porter)
accuracy_porter = accuracy_score(y_test, y_pred_porter)

# Using Snowball stemmer
X_train_snowball = stem_texts(X_train, snowball)
X_test_snowball = stem_texts(X_test, snowball)

vectorizer_snowball = CountVectorizer()
X_train_vec_snowball = vectorizer_snowball.fit_transform(X_train_snowball)
X_test_vec_snowball = vectorizer_snowball.transform(X_test_snowball)

model_snowball = MultinomialNB()
model_snowball.fit(X_train_vec_snowball, y_train)
y_pred_snowball = model_snowball.predict(X_test_vec_snowball)
accuracy_snowball = accuracy_score(y_test, y_pred_snowball)

print(f"Accuracy with Porter stemmer: {accuracy_porter * 100:.2f}%")
print(f"Accuracy with Snowball stemmer: {accuracy_snowball * 100:.2f}%")

Added Snowball stemmer as an alternative to Porter stemmer.

Stemmed the training and test texts using Snowball stemmer.

Re-trained the classification model with Snowball stemmed data.

Compared accuracy scores between Porter and Snowball stemmers.

Results Interpretation

Before: Accuracy with Porter stemmer was 80%.
After: Accuracy with Snowball stemmer improved to 90%.

Using a more consistent and aggressive stemmer like Snowball can improve text normalization, leading to better features and higher classification accuracy.

Bonus Experiment

Try using lemmatization instead of stemming and compare the classification accuracy.

💡 Hint

Use NLTK's WordNetLemmatizer and preprocess texts similarly. Lemmatization reduces words to dictionary forms, which may improve or reduce accuracy depending on the dataset.

Practice

(1/5)

1. What is the main purpose of stemming in Natural Language Processing?

easy

A. To reduce words to their base or root form

B. To translate text into another language

C. To count the number of words in a sentence

D. To generate synonyms for words

Stemming (Porter, Snowball) in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand stemming concept

Step 2: Compare options with stemming goal

Final Answer:

Quick Check:

Solution

Step 1: Recall correct import syntax in Python

Step 2: Match with NLTK Porter Stemmer import

Final Answer:

Quick Check:

Solution

Step 1: Apply Porter Stemmer to each word

Step 2: List the stemmed results

Final Answer:

Quick Check:

Solution

Step 1: Check SnowballStemmer import and usage

Step 2: Verify method call and output

Final Answer:

Quick Check:

Solution

Step 1: Understand the condition for stemming

Step 2: Check list comprehension syntax

Final Answer:

Quick Check: