Bird
Raised Fist0
NLPml~20 mins

Spam detection pipeline in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Spam detection pipeline
Problem:Build a model to classify text messages as spam or not spam.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.45
Issue:The model overfits: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only modify the model architecture and training hyperparameters.
Do not change the dataset or preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Sample data (replace with actual dataset loading)
texts = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.",
         "Nah I don't think he goes to usf, he lives around here though",
         "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward!",
         "Had your mobile 11 months or more? You are entitled to update to the latest colour mobiles with camera for free!",
         "I'm gonna be home soon and i don't want to talk about this stuff anymore tonight"]
labels = [1, 0, 1, 1, 0]  # 1 = spam, 0 = not spam

# Vectorize text data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()
y = np.array(labels)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=42)

# Build model with dropout and reduced neurons
model = Sequential([
    Dense(16, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.5),
    Dense(8, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train model
history = model.fit(X_train, y_train, epochs=50, batch_size=2, validation_data=(X_val, y_val), callbacks=[early_stop], verbose=0)

# Evaluate
train_loss, train_acc = model.evaluate(X_train, y_train, verbose=0)
val_loss, val_acc = model.evaluate(X_val, y_val, verbose=0)

print(f"Training accuracy: {train_acc*100:.2f}%, Validation accuracy: {val_acc*100:.2f}%")
print(f"Training loss: {train_loss:.4f}, Validation loss: {val_loss:.4f}")
Added Dropout layers with 50% rate to reduce overfitting.
Reduced number of neurons from larger layers to smaller (16 and 8).
Added EarlyStopping to stop training when validation loss stops improving.
Set learning rate to 0.001 for stable training.
Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.45

After: Training accuracy 90%, Validation accuracy 87%, Training loss 0.25, Validation loss 0.30

Adding dropout and early stopping helps reduce overfitting. This improves validation accuracy by making the model generalize better to new data.
Bonus Experiment
Try using a different text representation like TF-IDF instead of simple counts and see if validation accuracy improves further.
💡 Hint
Use sklearn's TfidfVectorizer instead of CountVectorizer and keep the rest of the pipeline the same.

Practice

(1/5)
1. What is the main purpose of a spam detection pipeline in NLP?
easy
A. To convert text messages into numbers and train a model to identify spam
B. To translate messages into different languages
C. To summarize long emails automatically
D. To generate new text messages based on spam examples

Solution

  1. Step 1: Understand the role of a spam detection pipeline

    A spam detection pipeline processes text data to prepare it for a machine learning model that can classify messages as spam or not spam.
  2. Step 2: Identify the key function

    The pipeline converts text into numbers (features) and trains a model to spot spam messages automatically.
  3. Final Answer:

    To convert text messages into numbers and train a model to identify spam -> Option A
  4. Quick Check:

    Spam detection pipeline = convert text + train model [OK]
Hint: Spam detection means turning text into numbers to train a model [OK]
Common Mistakes:
  • Thinking it translates or summarizes text
  • Confusing spam detection with text generation
  • Ignoring the conversion of text to numbers
2. Which of the following code snippets correctly creates a simple spam detection pipeline using scikit-learn's Pipeline with a TfidfVectorizer and a LogisticRegression model?
easy
A. Pipeline([('vectorizer', TfidfVectorizer()), ('model', LogisticRegression())])
B. Pipeline(('vectorizer', TfidfVectorizer()), ('model', LogisticRegression()))
C. Pipeline({'vectorizer': TfidfVectorizer(), 'model': LogisticRegression()})
D. Pipeline(['vectorizer' = TfidfVectorizer(), 'model' = LogisticRegression()])

Solution

  1. Step 1: Recall the correct syntax for scikit-learn Pipeline

    The Pipeline constructor expects a list of tuples, each tuple containing a name and a transformer or estimator.
  2. Step 2: Check each option's syntax

    Pipeline([('vectorizer', TfidfVectorizer()), ('model', LogisticRegression())]) uses a list of tuples correctly. Other options use incorrect syntax like using '=' inside lists, passing tuples as separate arguments, or dictionary syntax.
  3. Final Answer:

    Pipeline([('vectorizer', TfidfVectorizer()), ('model', LogisticRegression())]) -> Option A
  4. Quick Check:

    Pipeline syntax = list of (name, step) tuples [OK]
Hint: Pipeline needs a list of (name, step) tuples inside brackets [OK]
Common Mistakes:
  • Using parentheses instead of brackets for the list
  • Using dictionary syntax inside Pipeline
  • Assigning steps with '=' inside a list
3. Given the following code, what will be the output of print(predictions) if the input messages are ["Win a free prize now", "Meeting at noon"] and the model predicts 1 for spam and 0 for not spam?
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('model', LogisticRegression())
])

# Assume pipeline is already trained
messages = ["Win a free prize now", "Meeting at noon"]
predictions = pipeline.predict(messages)
print(predictions)
medium
A. [0 1]
B. [1 0]
C. [1 1]
D. [0 0]

Solution

  1. Step 1: Understand the input and model output

    The input has one spam-like message "Win a free prize now" and one normal message "Meeting at noon". The model labels spam as 1 and not spam as 0.
  2. Step 2: Predict expected labels

    The first message is likely spam, so prediction is 1. The second is normal, so prediction is 0.
  3. Final Answer:

    [1 0] -> Option B
  4. Quick Check:

    Spam message = 1, normal message = 0 [OK]
Hint: Spam message predicts 1, normal message predicts 0 [OK]
Common Mistakes:
  • Swapping labels 0 and 1
  • Assuming both messages are spam
  • Confusing output format with list of strings
4. Identify the error in this spam detection pipeline code and choose the correct fix:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vectorizer', CountVectorizer),
    ('model', LogisticRegression())
])

pipeline.fit(train_messages, train_labels)
medium
A. Add parentheses to pipeline.fit() call
B. Replace LogisticRegression() with LogisticRegression
C. Remove the pipeline and train model directly
D. Change CountVectorizer to CountVectorizer() to create an instance

Solution

  1. Step 1: Check the pipeline steps for correct instantiation

    CountVectorizer is a class and must be instantiated with parentheses to create an object.
  2. Step 2: Identify the error and fix

    The code uses CountVectorizer without parentheses, causing an error. Adding parentheses fixes it.
  3. Final Answer:

    Change CountVectorizer to CountVectorizer() to create an instance -> Option D
  4. Quick Check:

    Instantiate classes with () in pipeline steps [OK]
Hint: Always instantiate transformers with () in pipeline steps [OK]
Common Mistakes:
  • Forgetting parentheses after class names
  • Confusing model and vectorizer instantiation
  • Trying to remove pipeline instead of fixing syntax
5. You want to improve your spam detection pipeline by adding a step to remove common stop words before vectorizing. Which pipeline modification correctly adds this step using CountVectorizer with stop words removal?
hard
A. Pipeline([('stopwords', StopWordsRemover()), ('vectorizer', CountVectorizer()), ('model', LogisticRegression())])
B. Pipeline([('vectorizer', CountVectorizer()), ('stopwords', StopWordsRemover()), ('model', LogisticRegression())])
C. Pipeline([('vectorizer', CountVectorizer(stop_words='english')), ('model', LogisticRegression())])
D. Pipeline([('vectorizer', CountVectorizer(stop_words=None)), ('model', LogisticRegression())])

Solution

  1. Step 1: Understand how to remove stop words in CountVectorizer

    CountVectorizer has a parameter stop_words which can be set to 'english' to remove common English stop words automatically.
  2. Step 2: Check pipeline options for correct usage

    Pipeline([('vectorizer', CountVectorizer(stop_words='english')), ('model', LogisticRegression())]) correctly sets stop_words='english' inside CountVectorizer. Other options either use a non-existent StopWordsRemover step or set stop_words=None, which disables removal.
  3. Final Answer:

    Pipeline([('vectorizer', CountVectorizer(stop_words='english')), ('model', LogisticRegression())]) -> Option C
  4. Quick Check:

    Use stop_words='english' in CountVectorizer to remove stop words [OK]
Hint: Use stop_words='english' inside CountVectorizer [OK]
Common Mistakes:
  • Trying to add a separate stop words remover step
  • Setting stop_words to None disables removal
  • Misplacing stop words removal after vectorizing