NlpProgramBeginner · 2 min read

NLP Program to Classify Spam Messages

Use Python with scikit-learn's CountVectorizer to convert text to numbers and MultinomialNB to train a spam classifier, like model.fit(X_train, y_train) and predict with model.predict(X_test).

📋

Examples

InputFree money now!!!

Outputspam

InputHi, are we meeting today?

Outputnot spam

InputCongratulations, you won a prize!

Outputspam

🧠

How to Think About It

To classify spam, first convert text messages into numbers that a computer can understand using a method like counting words. Then, train a simple model that learns patterns from labeled examples of spam and not spam messages. Finally, use this model to predict if new messages are spam or not.

📐

Algorithm

Collect labeled messages as spam or not spam.

Convert messages into numerical features using word counts.

Split data into training and testing sets.

Train a Naive Bayes classifier on the training data.

Use the trained model to predict labels on test messages.

Evaluate the model's accuracy.

💻

Code

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

texts = ['Free money now!!!', 'Hi, are we meeting today?', 'Congratulations, you won a prize!', 'Call me now', 'Hello friend']
labels = ['spam', 'not spam', 'spam', 'not spam', 'not spam']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print('Predictions:', predictions.tolist())

Output

Predictions: ['spam', 'not spam']

🔍

Dry Run

Let's trace the example 'Free money now!!!' and 'Hi, are we meeting today?' through the code.

Convert texts to numbers

Texts become a matrix where each column is a word count, e.g., 'free':1, 'money':1, 'now':1 for 'Free money now!!!'.

Split data

Training set gets 3 messages, test set gets 2 messages.

Train model

Model learns word patterns linked to spam or not spam from training data.

Predict test labels

Model predicts 'spam' for 'Congratulations, you won a prize!' and 'not spam' for 'Hi, are we meeting today?'.

Message	Word Counts	Label
Free money now!!!	{free:1, money:1, now:1}	spam
Hi, are we meeting today?	{hi:1, are:1, we:1, meeting:1, today:1}	not spam
Congratulations, you won a prize!	{congratulations:1, you:1, won:1, a:1, prize:1}	spam
Call me now	{call:1, me:1, now:1}	not spam
Hello friend	{hello:1, friend:1}	not spam

💡

Why This Works

Step 1: Text to numbers

The CountVectorizer changes words into numbers so the model can understand text.

Step 2: Training the model

The MultinomialNB learns which words are common in spam or not spam messages.

Step 3: Making predictions

The model uses learned word patterns to guess if new messages are spam or not.

🔄

Alternative Approaches

Use TF-IDF Vectorizer with Logistic Regression

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

texts = ['Free money now!!!', 'Hi, are we meeting today?', 'Congratulations, you won a prize!', 'Call me now', 'Hello friend']
labels = ['spam', 'not spam', 'spam', 'not spam', 'not spam']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print('Predictions:', predictions.tolist())

TF-IDF captures word importance better; Logistic Regression can be more accurate but slower to train.

Use a simple rule-based keyword filter

python

def classify_spam(text):
    spam_words = ['free', 'win', 'prize', 'money']
    text_lower = text.lower()
    for word in spam_words:
        if word in text_lower:
            return 'spam'
    return 'not spam'

print(classify_spam('Free money now!!!'))
print(classify_spam('Hello friend'))

Very fast and simple but less accurate and cannot learn from data.

⚡

Complexity: O(nm) time, O(nm) space

Time Complexity

Converting n messages with m unique words to vectors takes O(n*m). Training Naive Bayes is also O(n*m) since it counts word frequencies.

Space Complexity

Storing the word count matrix requires O(n*m) space, where n is messages and m is vocabulary size.

Which Approach is Fastest?

Rule-based filtering is fastest but least accurate; Naive Bayes balances speed and accuracy; Logistic Regression is slower but can improve accuracy.

Approach	Time	Space	Best For
Naive Bayes with CountVectorizer	O(n*m)	O(n*m)	Balanced speed and accuracy
Logistic Regression with TF-IDF	O(n*m)	O(n*m)	Higher accuracy, slower training
Rule-based keyword filter	O(n*k)	O(k)	Very fast, simple, low accuracy

💡

Always clean and preprocess text before training for better spam detection.

⚠️

Beginners often forget to convert text into numbers before training the model.

Examples

How to Think About It

Algorithm

Code

Dry Run

Convert texts to numbers

Split data

Train model

Predict test labels

Why This Works

Step 1: Text to numbers

Step 2: Training the model

Step 3: Making predictions

Alternative Approaches

Complexity: O(n*m) time, O(n*m) space

Time Complexity

Space Complexity

Which Approach is Fastest?

Complexity: O(nm) time, O(nm) space