Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the main goal of a spam detection pipeline?
To automatically identify and filter out unwanted or harmful messages (spam) from legitimate messages.
Click to reveal answer
beginner
Name the typical steps in a spam detection pipeline.
1. Data collection 2. Text preprocessing 3. Feature extraction 4. Model training 5. Model evaluation 6. Prediction and filtering
Click to reveal answer
beginner
Why is text preprocessing important in spam detection?
It cleans and simplifies the text data by removing noise like punctuation, stop words, and converting text to lowercase, making it easier for the model to learn patterns.
Click to reveal answer
beginner
What is feature extraction in the context of spam detection?
It is the process of converting text messages into numerical data (features) that a machine learning model can understand, such as word counts or TF-IDF scores.
Click to reveal answer
intermediate
Which metric is commonly used to evaluate a spam detection model's performance?
Accuracy, Precision, Recall, and F1-score are commonly used. F1-score is especially important because it balances precision and recall, helping to measure how well the model detects spam without too many false alarms.
Click to reveal answer
What is the first step in a spam detection pipeline?
AFeature extraction
BModel training
CData collection
DPrediction
✗ Incorrect
The pipeline starts by collecting the data needed to train and test the model.
Which technique helps convert text into numbers for the model?
AText preprocessing
BFeature extraction
CModel evaluation
DData collection
✗ Incorrect
Feature extraction transforms text into numerical features the model can use.
Why do we remove stop words during preprocessing?
AThey are common words that do not add useful information
BThey are rare words
CThey add important meaning
DThey are numbers
✗ Incorrect
Stop words like 'the' or 'and' are very common and usually do not help the model learn.
Which metric balances false positives and false negatives in spam detection?
AAccuracy
BPrecision
CRecall
DF1-score
✗ Incorrect
F1-score combines precision and recall to give a balanced measure.
What does the model output in a spam detection pipeline?
ASpam or not spam prediction
BNumerical features
CCleaned text
DRaw data
✗ Incorrect
The model predicts whether a message is spam or not spam.
Describe the main steps involved in building a spam detection pipeline.
Think about how raw messages become predictions.
You got /6 concepts.
Explain why feature extraction is necessary for spam detection models.
Models cannot work directly with text.
You got /3 concepts.
Practice
(1/5)
1. What is the main purpose of a spam detection pipeline in NLP?
easy
A. To convert text messages into numbers and train a model to identify spam
B. To translate messages into different languages
C. To summarize long emails automatically
D. To generate new text messages based on spam examples
Solution
Step 1: Understand the role of a spam detection pipeline
A spam detection pipeline processes text data to prepare it for a machine learning model that can classify messages as spam or not spam.
Step 2: Identify the key function
The pipeline converts text into numbers (features) and trains a model to spot spam messages automatically.
Final Answer:
To convert text messages into numbers and train a model to identify spam -> Option A
Quick Check:
Spam detection pipeline = convert text + train model [OK]
Hint: Spam detection means turning text into numbers to train a model [OK]
Common Mistakes:
Thinking it translates or summarizes text
Confusing spam detection with text generation
Ignoring the conversion of text to numbers
2. Which of the following code snippets correctly creates a simple spam detection pipeline using scikit-learn's Pipeline with a TfidfVectorizer and a LogisticRegression model?
easy
A. Pipeline([('vectorizer', TfidfVectorizer()), ('model', LogisticRegression())])
B. Pipeline(('vectorizer', TfidfVectorizer()), ('model', LogisticRegression()))
C. Pipeline({'vectorizer': TfidfVectorizer(), 'model': LogisticRegression()})
D. Pipeline(['vectorizer' = TfidfVectorizer(), 'model' = LogisticRegression()])
Solution
Step 1: Recall the correct syntax for scikit-learn Pipeline
The Pipeline constructor expects a list of tuples, each tuple containing a name and a transformer or estimator.
Step 2: Check each option's syntax
Pipeline([('vectorizer', TfidfVectorizer()), ('model', LogisticRegression())]) uses a list of tuples correctly. Other options use incorrect syntax like using '=' inside lists, passing tuples as separate arguments, or dictionary syntax.
Final Answer:
Pipeline([('vectorizer', TfidfVectorizer()), ('model', LogisticRegression())]) -> Option A
Quick Check:
Pipeline syntax = list of (name, step) tuples [OK]
Hint: Pipeline needs a list of (name, step) tuples inside brackets [OK]
Common Mistakes:
Using parentheses instead of brackets for the list
Using dictionary syntax inside Pipeline
Assigning steps with '=' inside a list
3. Given the following code, what will be the output of print(predictions) if the input messages are ["Win a free prize now", "Meeting at noon"] and the model predicts 1 for spam and 0 for not spam?
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('model', LogisticRegression())
])
# Assume pipeline is already trained
messages = ["Win a free prize now", "Meeting at noon"]
predictions = pipeline.predict(messages)
print(predictions)
medium
A. [0 1]
B. [1 0]
C. [1 1]
D. [0 0]
Solution
Step 1: Understand the input and model output
The input has one spam-like message "Win a free prize now" and one normal message "Meeting at noon". The model labels spam as 1 and not spam as 0.
Step 2: Predict expected labels
The first message is likely spam, so prediction is 1. The second is normal, so prediction is 0.
Final Answer:
[1 0] -> Option B
Quick Check:
Spam message = 1, normal message = 0 [OK]
Hint: Spam message predicts 1, normal message predicts 0 [OK]
Common Mistakes:
Swapping labels 0 and 1
Assuming both messages are spam
Confusing output format with list of strings
4. Identify the error in this spam detection pipeline code and choose the correct fix:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('vectorizer', CountVectorizer),
('model', LogisticRegression())
])
pipeline.fit(train_messages, train_labels)
medium
A. Add parentheses to pipeline.fit() call
B. Replace LogisticRegression() with LogisticRegression
C. Remove the pipeline and train model directly
D. Change CountVectorizer to CountVectorizer() to create an instance
Solution
Step 1: Check the pipeline steps for correct instantiation
CountVectorizer is a class and must be instantiated with parentheses to create an object.
Step 2: Identify the error and fix
The code uses CountVectorizer without parentheses, causing an error. Adding parentheses fixes it.
Final Answer:
Change CountVectorizer to CountVectorizer() to create an instance -> Option D
Quick Check:
Instantiate classes with () in pipeline steps [OK]
Hint: Always instantiate transformers with () in pipeline steps [OK]
Common Mistakes:
Forgetting parentheses after class names
Confusing model and vectorizer instantiation
Trying to remove pipeline instead of fixing syntax
5. You want to improve your spam detection pipeline by adding a step to remove common stop words before vectorizing. Which pipeline modification correctly adds this step using CountVectorizer with stop words removal?
hard
A. Pipeline([('stopwords', StopWordsRemover()), ('vectorizer', CountVectorizer()), ('model', LogisticRegression())])
B. Pipeline([('vectorizer', CountVectorizer()), ('stopwords', StopWordsRemover()), ('model', LogisticRegression())])
C. Pipeline([('vectorizer', CountVectorizer(stop_words='english')), ('model', LogisticRegression())])
D. Pipeline([('vectorizer', CountVectorizer(stop_words=None)), ('model', LogisticRegression())])
Solution
Step 1: Understand how to remove stop words in CountVectorizer
CountVectorizer has a parameter stop_words which can be set to 'english' to remove common English stop words automatically.
Step 2: Check pipeline options for correct usage
Pipeline([('vectorizer', CountVectorizer(stop_words='english')), ('model', LogisticRegression())]) correctly sets stop_words='english' inside CountVectorizer. Other options either use a non-existent StopWordsRemover step or set stop_words=None, which disables removal.
Final Answer:
Pipeline([('vectorizer', CountVectorizer(stop_words='english')), ('model', LogisticRegression())]) -> Option C
Quick Check:
Use stop_words='english' in CountVectorizer to remove stop words [OK]
Hint: Use stop_words='english' inside CountVectorizer [OK]