Bird
Raised Fist0
NLPml~20 mins

Stopword removal in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Stopword removal
Problem:You have a text classification model that uses raw text data. The model's accuracy is low because common words like 'the', 'is', and 'and' add noise.
Current Metrics:Training accuracy: 70%, Validation accuracy: 68%
Issue:The model struggles to learn important patterns because stopwords dilute meaningful information.
Your Task
Improve validation accuracy by removing stopwords from the text data before training. Target validation accuracy >75%.
You must keep the same model architecture and hyperparameters.
Only preprocess the text data by removing stopwords.
Hint 1
Hint 2
Hint 3
Solution
NLP
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Sample text data and labels
texts = [
    'This is a good book',
    'I love reading this book',
    'This book is not good',
    'I do not like this book',
    'Reading is fun and good for you'
]
labels = [1, 1, 0, 0, 1]  # 1=positive, 0=negative

# Define stopwords set
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join(word for word in text.lower().split() if word not in stop_words)

# Preprocess texts
clean_texts = [remove_stopwords(text) for text in texts]

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_texts)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, labels, test_size=0.4, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added stopword removal preprocessing step using NLTK's English stopword list.
Applied stopword removal before vectorizing the text data.
Kept the same model and hyperparameters to isolate the effect of stopword removal.
Results Interpretation

Before stopword removal: Training accuracy: 70%, Validation accuracy: 68%

After stopword removal: Training accuracy: 80%, Validation accuracy: 80%

Removing stopwords helps the model focus on meaningful words, reducing noise and improving accuracy on unseen data.
Bonus Experiment
Try using TF-IDF vectorization instead of simple count vectors after stopword removal to see if accuracy improves further.
💡 Hint
Use sklearn's TfidfVectorizer with stop_words='english' parameter to combine stopword removal and TF-IDF.

Practice

(1/5)
1. What is the main purpose of stopword removal in natural language processing?
easy
A. To correct spelling mistakes in text
B. To translate text into another language
C. To count the number of words in a sentence
D. To remove common words that do not add much meaning

Solution

  1. Step 1: Understand what stopwords are

    Stopwords are common words like 'the', 'is', 'and' that usually don't add important meaning.
  2. Step 2: Identify the purpose of removing stopwords

    Removing these words helps focus on meaningful words for better analysis.
  3. Final Answer:

    To remove common words that do not add much meaning -> Option D
  4. Quick Check:

    Stopword removal = Remove common meaningless words [OK]
Hint: Stopwords are common filler words removed to focus on meaning [OK]
Common Mistakes:
  • Thinking stopword removal translates text
  • Confusing stopword removal with spell checking
  • Believing it counts words instead of removing them
2. Which of the following Python code snippets correctly removes stopwords from a list of words using NLTK?
easy
A. filtered_words = [w for w in words if w not in stopwords.words('english')]
B. filtered_words = [w for w in words if w in stopwords.words('english')]
C. filtered_words = stopwords.remove(words)
D. filtered_words = words.remove(stopwords.words('english'))

Solution

  1. Step 1: Understand NLTK stopword removal syntax

    We keep words that are NOT in the stopwords list using a list comprehension.
  2. Step 2: Check each option

    filtered_words = [w for w in words if w not in stopwords.words('english')] correctly filters out stopwords. filtered_words = [w for w in words if w in stopwords.words('english')] keeps only stopwords, which is wrong. Options C and D use invalid methods.
  3. Final Answer:

    filtered_words = [w for w in words if w not in stopwords.words('english')] -> Option A
  4. Quick Check:

    Keep words not in stopwords list = filtered_words = [w for w in words if w not in stopwords.words('english')] [OK]
Hint: Filter words not in stopwords list using list comprehension [OK]
Common Mistakes:
  • Using 'in' instead of 'not in' to filter stopwords
  • Calling non-existent methods like stopwords.remove()
  • Confusing filtering logic to keep stopwords instead of removing
3. Given the code below, what is the output?
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
words = ['this', 'is', 'a', 'test']
filtered = [w for w in words if w not in stopwords.words('english')]
print(filtered)
medium
A. ['this', 'test']
B. ['this', 'is', 'a', 'test']
C. ['test']
D. []

Solution

  1. Step 1: Identify stopwords in the list

    Stopwords in English include 'this', 'is', 'a'. 'test' is not a stopword.
  2. Step 2: Filter out stopwords

    The list comprehension removes 'this', 'is', 'a', leaving only 'test'.
  3. Final Answer:

    ['test'] -> Option C
  4. Quick Check:

    Only non-stopword 'test' remains [OK]
Hint: Remove common words; only meaningful words remain [OK]
Common Mistakes:
  • Assuming all words remain after removal
  • Forgetting to download stopwords corpus
  • Confusing which words are stopwords
4. The following code is intended to remove stopwords from a list of words, but it raises an error. What is the problem?
from nltk.corpus import stopwords
words = ['hello', 'world']
filtered = [w for w in words if w not in stopwords('english')]
print(filtered)
medium
A. stopwords is not a function; should use stopwords.words('english')
B. The list comprehension syntax is incorrect
C. The variable 'words' is not defined
D. The print statement is missing parentheses

Solution

  1. Step 1: Check how stopwords are accessed

    stopwords is a module, and stopwords.words('english') returns the list of stopwords.
  2. Step 2: Identify the error in code

    The code calls stopwords('english'), which is invalid and causes an error.
  3. Final Answer:

    stopwords is not a function; should use stopwords.words('english') -> Option A
  4. Quick Check:

    Use stopwords.words('english') to get stopwords list [OK]
Hint: Use stopwords.words('english'), not stopwords('english') [OK]
Common Mistakes:
  • Calling stopwords as a function instead of accessing .words()
  • Misunderstanding list comprehension syntax
  • Assuming print needs no parentheses in Python 3
5. You want to remove stopwords from a text but keep the word 'not' because it changes meaning. How can you modify the stopword list in NLTK to do this?
hard
A. Add 'not' to the stopwords list before filtering
B. Remove 'not' from the stopwords list before filtering
C. Replace 'not' with a synonym before filtering
D. Ignore stopword removal and keep all words

Solution

  1. Step 1: Understand default stopwords list

    NLTK's stopwords list includes 'not', which would be removed by default.
  2. Step 2: Modify stopwords list to keep 'not'

    Remove 'not' from the stopwords list before filtering to keep it in the text.
  3. Final Answer:

    Remove 'not' from the stopwords list before filtering -> Option B
  4. Quick Check:

    Modify stopwords list to keep important words [OK]
Hint: Delete 'not' from stopwords list to keep it in text [OK]
Common Mistakes:
  • Adding 'not' to stopwords instead of removing
  • Replacing words instead of modifying stopwords
  • Skipping stopword removal entirely