MlopsProgramBeginner · 2 min read

Python Sklearn Program to Classify Spam Email

Use CountVectorizer to convert emails to numbers, then train a MultinomialNB model with fit(), and predict spam with predict() in sklearn.

📋

Examples

InputEmail: 'Win a free prize now', Label: spam

OutputPrediction: spam

InputEmail: 'Meeting at 10am tomorrow', Label: not spam

OutputPrediction: not spam

InputEmail: '', Label: not spam

OutputPrediction: not spam

🧠

How to Think About It

To classify spam emails, first convert the text emails into numbers using a tool like CountVectorizer. Then, train a simple model like MultinomialNB on labeled emails (spam or not spam). Finally, use the trained model to predict if new emails are spam or not.

📐

Algorithm

Collect labeled email data with spam and not spam labels.

Convert email text into numeric features using text vectorization.

Train a Naive Bayes classifier on the numeric data and labels.

Use the trained model to predict spam or not spam on new emails.

Evaluate the model accuracy by comparing predictions to true labels.

💻

Code

sklearn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended']
labels = ['spam', 'not spam', 'spam', 'not spam']

# Convert text to numbers
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Predictions:', y_pred)

Output

Accuracy: 1.0 Predictions: ['not spam' 'spam']

🔍

Dry Run

Let's trace the example emails through the code to see how spam classification works.

Convert emails to numbers

Emails like 'Win a free prize now' become counts of words like {'win':1, 'free':1, 'prize':1, 'now':1}.

Split data

Two emails go to training, two to testing, keeping labels aligned.

Train model

Model learns which words appear more in spam vs not spam.

Predict on test emails

Model predicts labels for test emails based on learned word patterns.

Calculate accuracy

Compare predicted labels to true labels to get accuracy score.

Email	Label	Vectorized Features	Predicted Label
Meeting at 10am tomorrow	not spam	{'meeting':1, 'at':1, '10am':1, 'tomorrow':1}	not spam
Cheap meds available	spam	{'cheap':1, 'meds':1, 'available':1}	spam

💡

Why This Works

Step 1: Text to numbers

The CountVectorizer changes words into numbers so the model can understand text.

Step 2: Training the model

The MultinomialNB learns which words are common in spam or not spam emails.

Step 3: Making predictions

The model uses learned word patterns to guess if new emails are spam or not.

🔄

Alternative Approaches

Use TfidfVectorizer with Logistic Regression

sklearn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended']
labels = ['spam', 'not spam', 'spam', 'not spam']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Predictions:', y_pred)

TfidfVectorizer weighs words by importance; Logistic Regression can be more accurate but slower to train.

Use CountVectorizer with Support Vector Machine (SVM)

sklearn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended']
labels = ['spam', 'not spam', 'spam', 'not spam']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

model = LinearSVC()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Predictions:', y_pred)

SVM often performs well on text classification but can be slower and needs tuning.

⚡

Complexity: O(nm) time, O(nm) space

Time Complexity

Vectorizing text takes time proportional to number of emails (n) times average words per email (m). Training Naive Bayes is linear in number of samples and features.

Space Complexity

Storing the vectorized emails requires space proportional to n*m, where m is vocabulary size.

Which Approach is Fastest?

Naive Bayes with CountVectorizer is fastest and simplest; Logistic Regression and SVM may give better accuracy but take more time.

Approach	Time	Space	Best For
Naive Bayes + CountVectorizer	O(n*m)	O(n*m)	Fast training, simple spam detection
Logistic Regression + TfidfVectorizer	O(n*m)	O(n*m)	Better accuracy, slower training
SVM + CountVectorizer	O(n*m)	O(n*m)	High accuracy, needs tuning

💡

Always split your data into training and testing sets to check how well your spam classifier works on new emails.

⚠️

Beginners often forget to convert text emails into numeric features before training the model.

Examples

How to Think About It

Algorithm

Code

Dry Run

Convert emails to numbers

Split data

Train model

Predict on test emails

Calculate accuracy

Why This Works

Step 1: Text to numbers

Step 2: Training the model

Step 3: Making predictions

Alternative Approaches

Complexity: O(n*m) time, O(n*m) space

Time Complexity

Space Complexity

Which Approach is Fastest?

Complexity: O(nm) time, O(nm) space