NLPml~20 mins

Stopword removal in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Stopword removal

Problem:You have a text classification model that uses raw text data. The model's accuracy is low because common words like 'the', 'is', and 'and' add noise.

Current Metrics:Training accuracy: 70%, Validation accuracy: 68%

Issue:The model struggles to learn important patterns because stopwords dilute meaningful information.

Your Task

Improve validation accuracy by removing stopwords from the text data before training. Target validation accuracy >75%.

You must keep the same model architecture and hyperparameters.

Only preprocess the text data by removing stopwords.

Hint 1

Hint 2

Hint 3

Solution

NLP

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Sample text data and labels
texts = [
    'This is a good book',
    'I love reading this book',
    'This book is not good',
    'I do not like this book',
    'Reading is fun and good for you'
]
labels = [1, 1, 0, 0, 1]  # 1=positive, 0=negative

# Define stopwords set
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join(word for word in text.lower().split() if word not in stop_words)

# Preprocess texts
clean_texts = [remove_stopwords(text) for text in texts]

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_texts)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, labels, test_size=0.4, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Added stopword removal preprocessing step using NLTK's English stopword list.

Applied stopword removal before vectorizing the text data.

Kept the same model and hyperparameters to isolate the effect of stopword removal.

Results Interpretation

Before stopword removal: Training accuracy: 70%, Validation accuracy: 68%

After stopword removal: Training accuracy: 80%, Validation accuracy: 80%

Removing stopwords helps the model focus on meaningful words, reducing noise and improving accuracy on unseen data.

Bonus Experiment

Try using TF-IDF vectorization instead of simple count vectors after stopword removal to see if accuracy improves further.

💡 Hint

Use sklearn's TfidfVectorizer with stop_words='english' parameter to combine stopword removal and TF-IDF.

Practice

(1/5)

1. What is the main purpose of stopword removal in natural language processing?

easy

A. To correct spelling mistakes in text

B. To translate text into another language

C. To count the number of words in a sentence

D. To remove common words that do not add much meaning

Stopword removal in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand what stopwords are

Step 2: Identify the purpose of removing stopwords

Final Answer:

Quick Check:

Solution

Step 1: Understand NLTK stopword removal syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Identify stopwords in the list

Step 2: Filter out stopwords

Final Answer:

Quick Check:

Solution

Step 1: Check how stopwords are accessed

Step 2: Identify the error in code

Final Answer:

Quick Check:

Solution

Step 1: Understand default stopwords list

Step 2: Modify stopwords list to keep 'not'

Final Answer:

Quick Check: