0
0
NLPml~5 mins

Spam detection pipeline in NLP

Choose your learning style9 modes available
Introduction

We want to automatically find out if a message is spam or not. This helps keep our inbox clean and safe.

When you want to filter unwanted emails from your inbox.
When building a chat app that blocks spam messages.
When sorting customer feedback into useful and spam categories.
When creating a system to detect fake reviews or comments.
Syntax
NLP
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

The pipeline chains steps: first it turns text into numbers, then it trains a model.

Each step has a name and a method, making it easy to manage.

Examples
This example removes common English words before training.
NLP
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])
This example uses single words and pairs of words to better understand the text.
NLP
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', MultinomialNB())
])
Sample Model

This program trains a spam detector on a small set of messages. It then tests how well it can tell spam from normal messages.

NLP
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Sample data: messages and labels (spam=1, not spam=0)
messages = [
    'Win money now',
    'Hello friend, how are you?',
    'Cheap meds available',
    'Are we meeting today?',
    'Congratulations, you won a prize',
    'Can we have a call tomorrow?',
    'Get rich quick scheme',
    'Lunch at noon?'
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(messages, labels, test_size=0.25, random_state=42)

# Create the spam detection pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on test data
predictions = pipeline.predict(X_test)

# Print accuracy and detailed report
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("Classification Report:")
print(classification_report(y_test, predictions))
OutputSuccess
Important Notes

Using a pipeline helps keep your code clean and easy to update.

CountVectorizer turns words into numbers that the model can understand.

Multinomial Naive Bayes is a simple but effective model for text classification.

Summary

A spam detection pipeline turns text into numbers and then trains a model to spot spam.

It is useful for filtering unwanted messages automatically.

Using pipelines makes your machine learning code easier to build and maintain.