0
0
MlopsProgramBeginner · 2 min read

Python Sklearn Program to Classify Spam Email

Use CountVectorizer to convert emails to numbers, then train a MultinomialNB model with fit(), and predict spam with predict() in sklearn.
📋

Examples

InputEmail: 'Win a free prize now', Label: spam
OutputPrediction: spam
InputEmail: 'Meeting at 10am tomorrow', Label: not spam
OutputPrediction: not spam
InputEmail: '', Label: not spam
OutputPrediction: not spam
🧠

How to Think About It

To classify spam emails, first convert the text emails into numbers using a tool like CountVectorizer. Then, train a simple model like MultinomialNB on labeled emails (spam or not spam). Finally, use the trained model to predict if new emails are spam or not.
📐

Algorithm

1
Collect labeled email data with spam and not spam labels.
2
Convert email text into numeric features using text vectorization.
3
Train a Naive Bayes classifier on the numeric data and labels.
4
Use the trained model to predict spam or not spam on new emails.
5
Evaluate the model accuracy by comparing predictions to true labels.
💻

Code

sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended']
labels = ['spam', 'not spam', 'spam', 'not spam']

# Convert text to numbers
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Predictions:', y_pred)
Output
Accuracy: 1.0 Predictions: ['not spam' 'spam']
🔍

Dry Run

Let's trace the example emails through the code to see how spam classification works.

1

Convert emails to numbers

Emails like 'Win a free prize now' become counts of words like {'win':1, 'free':1, 'prize':1, 'now':1}.

2

Split data

Two emails go to training, two to testing, keeping labels aligned.

3

Train model

Model learns which words appear more in spam vs not spam.

4

Predict on test emails

Model predicts labels for test emails based on learned word patterns.

5

Calculate accuracy

Compare predicted labels to true labels to get accuracy score.

EmailLabelVectorized FeaturesPredicted Label
Meeting at 10am tomorrownot spam{'meeting':1, 'at':1, '10am':1, 'tomorrow':1}not spam
Cheap meds availablespam{'cheap':1, 'meds':1, 'available':1}spam
💡

Why This Works

Step 1: Text to numbers

The CountVectorizer changes words into numbers so the model can understand text.

Step 2: Training the model

The MultinomialNB learns which words are common in spam or not spam emails.

Step 3: Making predictions

The model uses learned word patterns to guess if new emails are spam or not.

🔄

Alternative Approaches

Use TfidfVectorizer with Logistic Regression
sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended']
labels = ['spam', 'not spam', 'spam', 'not spam']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Predictions:', y_pred)
TfidfVectorizer weighs words by importance; Logistic Regression can be more accurate but slower to train.
Use CountVectorizer with Support Vector Machine (SVM)
sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended']
labels = ['spam', 'not spam', 'spam', 'not spam']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

model = LinearSVC()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Predictions:', y_pred)
SVM often performs well on text classification but can be slower and needs tuning.

Complexity: O(n*m) time, O(n*m) space

Time Complexity

Vectorizing text takes time proportional to number of emails (n) times average words per email (m). Training Naive Bayes is linear in number of samples and features.

Space Complexity

Storing the vectorized emails requires space proportional to n*m, where m is vocabulary size.

Which Approach is Fastest?

Naive Bayes with CountVectorizer is fastest and simplest; Logistic Regression and SVM may give better accuracy but take more time.

ApproachTimeSpaceBest For
Naive Bayes + CountVectorizerO(n*m)O(n*m)Fast training, simple spam detection
Logistic Regression + TfidfVectorizerO(n*m)O(n*m)Better accuracy, slower training
SVM + CountVectorizerO(n*m)O(n*m)High accuracy, needs tuning
💡
Always split your data into training and testing sets to check how well your spam classifier works on new emails.
⚠️
Beginners often forget to convert text emails into numeric features before training the model.