0
0
NLPml~20 mins

Stopword removal in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Stopword removal
Problem:You have a text classification model that uses raw text data. The model's accuracy is low because common words like 'the', 'is', and 'and' add noise.
Current Metrics:Training accuracy: 70%, Validation accuracy: 68%
Issue:The model struggles to learn important patterns because stopwords dilute meaningful information.
Your Task
Improve validation accuracy by removing stopwords from the text data before training. Target validation accuracy >75%.
You must keep the same model architecture and hyperparameters.
Only preprocess the text data by removing stopwords.
Hint 1
Hint 2
Hint 3
Solution
NLP
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Sample text data and labels
texts = [
    'This is a good book',
    'I love reading this book',
    'This book is not good',
    'I do not like this book',
    'Reading is fun and good for you'
]
labels = [1, 1, 0, 0, 1]  # 1=positive, 0=negative

# Define stopwords set
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join(word for word in text.lower().split() if word not in stop_words)

# Preprocess texts
clean_texts = [remove_stopwords(text) for text in texts]

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_texts)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, labels, test_size=0.4, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added stopword removal preprocessing step using NLTK's English stopword list.
Applied stopword removal before vectorizing the text data.
Kept the same model and hyperparameters to isolate the effect of stopword removal.
Results Interpretation

Before stopword removal: Training accuracy: 70%, Validation accuracy: 68%

After stopword removal: Training accuracy: 80%, Validation accuracy: 80%

Removing stopwords helps the model focus on meaningful words, reducing noise and improving accuracy on unseen data.
Bonus Experiment
Try using TF-IDF vectorization instead of simple count vectors after stopword removal to see if accuracy improves further.
💡 Hint
Use sklearn's TfidfVectorizer with stop_words='english' parameter to combine stopword removal and TF-IDF.