Python Sklearn Program to Classify Spam Email
CountVectorizer to convert emails to numbers, then train a MultinomialNB model with fit(), and predict spam with predict() in sklearn.Examples
How to Think About It
CountVectorizer. Then, train a simple model like MultinomialNB on labeled emails (spam or not spam). Finally, use the trained model to predict if new emails are spam or not.Algorithm
Code
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample data emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended'] labels = ['spam', 'not spam', 'spam', 'not spam'] # Convert text to numbers vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) # Split data X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42) # Train model model = MultinomialNB() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Print accuracy print('Accuracy:', accuracy_score(y_test, y_pred)) print('Predictions:', y_pred)
Dry Run
Let's trace the example emails through the code to see how spam classification works.
Convert emails to numbers
Emails like 'Win a free prize now' become counts of words like {'win':1, 'free':1, 'prize':1, 'now':1}.
Split data
Two emails go to training, two to testing, keeping labels aligned.
Train model
Model learns which words appear more in spam vs not spam.
Predict on test emails
Model predicts labels for test emails based on learned word patterns.
Calculate accuracy
Compare predicted labels to true labels to get accuracy score.
| Label | Vectorized Features | Predicted Label | |
|---|---|---|---|
| Meeting at 10am tomorrow | not spam | {'meeting':1, 'at':1, '10am':1, 'tomorrow':1} | not spam |
| Cheap meds available | spam | {'cheap':1, 'meds':1, 'available':1} | spam |
Why This Works
Step 1: Text to numbers
The CountVectorizer changes words into numbers so the model can understand text.
Step 2: Training the model
The MultinomialNB learns which words are common in spam or not spam emails.
Step 3: Making predictions
The model uses learned word patterns to guess if new emails are spam or not.
Alternative Approaches
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended'] labels = ['spam', 'not spam', 'spam', 'not spam'] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(emails) X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42) model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) y_pred = model.predict(X_test) print('Accuracy:', accuracy_score(y_test, y_pred)) print('Predictions:', y_pred)
from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score emails = ['Win a free prize now', 'Meeting at 10am tomorrow', 'Cheap meds available', 'Project deadline extended'] labels = ['spam', 'not spam', 'spam', 'not spam'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42) model = LinearSVC() model.fit(X_train, y_train) y_pred = model.predict(X_test) print('Accuracy:', accuracy_score(y_test, y_pred)) print('Predictions:', y_pred)
Complexity: O(n*m) time, O(n*m) space
Time Complexity
Vectorizing text takes time proportional to number of emails (n) times average words per email (m). Training Naive Bayes is linear in number of samples and features.
Space Complexity
Storing the vectorized emails requires space proportional to n*m, where m is vocabulary size.
Which Approach is Fastest?
Naive Bayes with CountVectorizer is fastest and simplest; Logistic Regression and SVM may give better accuracy but take more time.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Naive Bayes + CountVectorizer | O(n*m) | O(n*m) | Fast training, simple spam detection |
| Logistic Regression + TfidfVectorizer | O(n*m) | O(n*m) | Better accuracy, slower training |
| SVM + CountVectorizer | O(n*m) | O(n*m) | High accuracy, needs tuning |