NLP Program to Classify Spam Messages
CountVectorizer to convert text to numbers and MultinomialNB to train a spam classifier, like model.fit(X_train, y_train) and predict with model.predict(X_test).Examples
How to Think About It
Algorithm
Code
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split texts = ['Free money now!!!', 'Hi, are we meeting today?', 'Congratulations, you won a prize!', 'Call me now', 'Hello friend'] labels = ['spam', 'not spam', 'spam', 'not spam', 'not spam'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42) model = MultinomialNB() model.fit(X_train, y_train) predictions = model.predict(X_test) print('Predictions:', predictions.tolist())
Dry Run
Let's trace the example 'Free money now!!!' and 'Hi, are we meeting today?' through the code.
Convert texts to numbers
Texts become a matrix where each column is a word count, e.g., 'free':1, 'money':1, 'now':1 for 'Free money now!!!'.
Split data
Training set gets 3 messages, test set gets 2 messages.
Train model
Model learns word patterns linked to spam or not spam from training data.
Predict test labels
Model predicts 'spam' for 'Congratulations, you won a prize!' and 'not spam' for 'Hi, are we meeting today?'.
| Message | Word Counts | Label |
|---|---|---|
| Free money now!!! | {free:1, money:1, now:1} | spam |
| Hi, are we meeting today? | {hi:1, are:1, we:1, meeting:1, today:1} | not spam |
| Congratulations, you won a prize! | {congratulations:1, you:1, won:1, a:1, prize:1} | spam |
| Call me now | {call:1, me:1, now:1} | not spam |
| Hello friend | {hello:1, friend:1} | not spam |
Why This Works
Step 1: Text to numbers
The CountVectorizer changes words into numbers so the model can understand text.
Step 2: Training the model
The MultinomialNB learns which words are common in spam or not spam messages.
Step 3: Making predictions
The model uses learned word patterns to guess if new messages are spam or not.
Alternative Approaches
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split texts = ['Free money now!!!', 'Hi, are we meeting today?', 'Congratulations, you won a prize!', 'Call me now', 'Hello friend'] labels = ['spam', 'not spam', 'spam', 'not spam', 'not spam'] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) print('Predictions:', predictions.tolist())
def classify_spam(text): spam_words = ['free', 'win', 'prize', 'money'] text_lower = text.lower() for word in spam_words: if word in text_lower: return 'spam' return 'not spam' print(classify_spam('Free money now!!!')) print(classify_spam('Hello friend'))
Complexity: O(n*m) time, O(n*m) space
Time Complexity
Converting n messages with m unique words to vectors takes O(n*m). Training Naive Bayes is also O(n*m) since it counts word frequencies.
Space Complexity
Storing the word count matrix requires O(n*m) space, where n is messages and m is vocabulary size.
Which Approach is Fastest?
Rule-based filtering is fastest but least accurate; Naive Bayes balances speed and accuracy; Logistic Regression is slower but can improve accuracy.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Naive Bayes with CountVectorizer | O(n*m) | O(n*m) | Balanced speed and accuracy |
| Logistic Regression with TF-IDF | O(n*m) | O(n*m) | Higher accuracy, slower training |
| Rule-based keyword filter | O(n*k) | O(k) | Very fast, simple, low accuracy |
