We want to automatically find out if a message is spam or not. This helps keep our inbox clean and safe.
0
0
Spam detection pipeline in NLP
Introduction
When you want to filter unwanted emails from your inbox.
When building a chat app that blocks spam messages.
When sorting customer feedback into useful and spam categories.
When creating a system to detect fake reviews or comments.
Syntax
NLP
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ])
The pipeline chains steps: first it turns text into numbers, then it trains a model.
Each step has a name and a method, making it easy to manage.
Examples
This example removes common English words before training.
NLP
pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('classifier', MultinomialNB())
])This example uses single words and pairs of words to better understand the text.
NLP
pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2))),
('classifier', MultinomialNB())
])Sample Model
This program trains a spam detector on a small set of messages. It then tests how well it can tell spam from normal messages.
NLP
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report # Sample data: messages and labels (spam=1, not spam=0) messages = [ 'Win money now', 'Hello friend, how are you?', 'Cheap meds available', 'Are we meeting today?', 'Congratulations, you won a prize', 'Can we have a call tomorrow?', 'Get rich quick scheme', 'Lunch at noon?' ] labels = [1, 0, 1, 0, 1, 0, 1, 0] # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(messages, labels, test_size=0.25, random_state=42) # Create the spam detection pipeline pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ]) # Train the model pipeline.fit(X_train, y_train) # Predict on test data predictions = pipeline.predict(X_test) # Print accuracy and detailed report print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}") print("Classification Report:") print(classification_report(y_test, predictions))
OutputSuccess
Important Notes
Using a pipeline helps keep your code clean and easy to update.
CountVectorizer turns words into numbers that the model can understand.
Multinomial Naive Bayes is a simple but effective model for text classification.
Summary
A spam detection pipeline turns text into numbers and then trains a model to spot spam.
It is useful for filtering unwanted messages automatically.
Using pipelines makes your machine learning code easier to build and maintain.