0
0
NLPml~20 mins

Spam detection pipeline in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Spam Detection Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Model Choice
intermediate
2:00remaining
Choosing the best model for spam detection

You want to build a spam detection system that classifies emails as spam or not spam. Which model is best suited for this binary text classification task?

AA linear Support Vector Machine (SVM) with TF-IDF features
BK-Means clustering with raw email text
CA Convolutional Neural Network (CNN) designed for image recognition
DPrincipal Component Analysis (PCA) for dimensionality reduction
Attempts:
2 left
💡 Hint

Think about models that work well with text features and binary classification.

Predict Output
intermediate
2:00remaining
Output of text preprocessing step

Given this Python code that preprocesses email text, what is the output?

NLP
import re
text = "Hello!!! This is spam??? Visit http://spam.com now."
cleaned = re.sub(r'http\S+', '', text)
cleaned = re.sub(r'[^a-zA-Z ]', '', cleaned).lower().split()
print(cleaned)
A['hello', 'this', 'is', 'spam', 'visit', 'now']
B['hello!!!', 'this', 'is', 'spam???', 'visit', 'http://spam.com', 'now']
C['hello', 'this', 'is', 'spam', 'visit', 'httpspamcom', 'now']
D['hello', 'this', 'is', 'spam', 'visit', 'http', 'spamcom', 'now']
Attempts:
2 left
💡 Hint

Look at how URLs and punctuation are removed, and text is lowercased and split.

Hyperparameter
advanced
2:00remaining
Choosing the best hyperparameter for TF-IDF vectorizer

In a spam detection pipeline, you use a TF-IDF vectorizer. Which max_features value is best to balance performance and speed on a large email dataset?

Amax_features=10
Bmax_features=1000000
Cmax_features=None
Dmax_features=10000
Attempts:
2 left
💡 Hint

Too few features may miss important words; too many may slow training.

Metrics
advanced
2:00remaining
Interpreting spam detection model metrics

Your spam detection model has these results on test data: 90% accuracy, 70% precision, 95% recall. What does this mean?

AThe model correctly identifies most spam emails but also marks many non-spam as spam.
BThe model rarely misses spam emails but sometimes wrongly flags non-spam as spam.
CThe model is very precise but misses many spam emails.
DThe model has balanced precision and recall, so it is perfect.
Attempts:
2 left
💡 Hint

Recall is about catching spam; precision is about avoiding false alarms.

🔧 Debug
expert
3:00remaining
Debugging model training failure in spam detection

Why does this spam detection training code raise a ValueError?

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
texts = ["Free money now", "Hello friend", "Win a prize"]
labels = [1, 0]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
model = LogisticRegression()
model.fit(X, labels)
Afit_transform returns a dense matrix, but LogisticRegression needs sparse
BLogisticRegression requires labels to be strings, not integers
CThe labels list length does not match the number of texts
DTfidfVectorizer cannot process short texts
Attempts:
2 left
💡 Hint

Check if the number of labels matches the number of input samples.