NLPml~20 mins

Spam detection pipeline in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Spam detection pipeline

Problem:Build a model to classify text messages as spam or not spam.

Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.45

Issue:The model overfits: training accuracy is very high but validation accuracy is much lower.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.

You can only modify the model architecture and training hyperparameters.

Do not change the dataset or preprocessing steps.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Sample data (replace with actual dataset loading)
texts = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.",
         "Nah I don't think he goes to usf, he lives around here though",
         "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward!",
         "Had your mobile 11 months or more? You are entitled to update to the latest colour mobiles with camera for free!",
         "I'm gonna be home soon and i don't want to talk about this stuff anymore tonight"]
labels = [1, 0, 1, 1, 0]  # 1 = spam, 0 = not spam

# Vectorize text data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()
y = np.array(labels)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=42)

# Build model with dropout and reduced neurons
model = Sequential([
    Dense(16, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.5),
    Dense(8, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train model
history = model.fit(X_train, y_train, epochs=50, batch_size=2, validation_data=(X_val, y_val), callbacks=[early_stop], verbose=0)

# Evaluate
train_loss, train_acc = model.evaluate(X_train, y_train, verbose=0)
val_loss, val_acc = model.evaluate(X_val, y_val, verbose=0)

print(f"Training accuracy: {train_acc*100:.2f}%, Validation accuracy: {val_acc*100:.2f}%")
print(f"Training loss: {train_loss:.4f}, Validation loss: {val_loss:.4f}")

Added Dropout layers with 50% rate to reduce overfitting.

Reduced number of neurons from larger layers to smaller (16 and 8).

Added EarlyStopping to stop training when validation loss stops improving.

Set learning rate to 0.001 for stable training.

Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.45

After: Training accuracy 90%, Validation accuracy 87%, Training loss 0.25, Validation loss 0.30

Adding dropout and early stopping helps reduce overfitting. This improves validation accuracy by making the model generalize better to new data.

Bonus Experiment

Try using a different text representation like TF-IDF instead of simple counts and see if validation accuracy improves further.

💡 Hint

Use sklearn's TfidfVectorizer instead of CountVectorizer and keep the rest of the pipeline the same.

Practice

(1/5)

1. What is the main purpose of a spam detection pipeline in NLP?

easy

A. To convert text messages into numbers and train a model to identify spam

B. To translate messages into different languages

C. To summarize long emails automatically

D. To generate new text messages based on spam examples

Spam detection pipeline in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of a spam detection pipeline

Step 2: Identify the key function

Final Answer:

Quick Check:

Solution

Step 1: Recall the correct syntax for scikit-learn Pipeline

Step 2: Check each option's syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand the input and model output

Step 2: Predict expected labels

Final Answer:

Quick Check:

Solution

Step 1: Check the pipeline steps for correct instantiation

Step 2: Identify the error and fix

Final Answer:

Quick Check:

Solution

Step 1: Understand how to remove stop words in CountVectorizer

Step 2: Check pipeline options for correct usage

Final Answer:

Quick Check: