Bird
Raised Fist0
NlpProgramBeginner · 2 min read

NLP Program to Classify Spam Messages

Use Python with scikit-learn's CountVectorizer to convert text to numbers and MultinomialNB to train a spam classifier, like model.fit(X_train, y_train) and predict with model.predict(X_test).
📋

Examples

InputFree money now!!!
Outputspam
InputHi, are we meeting today?
Outputnot spam
InputCongratulations, you won a prize!
Outputspam
🧠

How to Think About It

To classify spam, first convert text messages into numbers that a computer can understand using a method like counting words. Then, train a simple model that learns patterns from labeled examples of spam and not spam messages. Finally, use this model to predict if new messages are spam or not.
📐

Algorithm

1
Collect labeled messages as spam or not spam.
2
Convert messages into numerical features using word counts.
3
Split data into training and testing sets.
4
Train a Naive Bayes classifier on the training data.
5
Use the trained model to predict labels on test messages.
6
Evaluate the model's accuracy.
💻

Code

python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

texts = ['Free money now!!!', 'Hi, are we meeting today?', 'Congratulations, you won a prize!', 'Call me now', 'Hello friend']
labels = ['spam', 'not spam', 'spam', 'not spam', 'not spam']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print('Predictions:', predictions.tolist())
Output
Predictions: ['spam', 'not spam']
🔍

Dry Run

Let's trace the example 'Free money now!!!' and 'Hi, are we meeting today?' through the code.

1

Convert texts to numbers

Texts become a matrix where each column is a word count, e.g., 'free':1, 'money':1, 'now':1 for 'Free money now!!!'.

2

Split data

Training set gets 3 messages, test set gets 2 messages.

3

Train model

Model learns word patterns linked to spam or not spam from training data.

4

Predict test labels

Model predicts 'spam' for 'Congratulations, you won a prize!' and 'not spam' for 'Hi, are we meeting today?'.

MessageWord CountsLabel
Free money now!!!{free:1, money:1, now:1}spam
Hi, are we meeting today?{hi:1, are:1, we:1, meeting:1, today:1}not spam
Congratulations, you won a prize!{congratulations:1, you:1, won:1, a:1, prize:1}spam
Call me now{call:1, me:1, now:1}not spam
Hello friend{hello:1, friend:1}not spam
💡

Why This Works

Step 1: Text to numbers

The CountVectorizer changes words into numbers so the model can understand text.

Step 2: Training the model

The MultinomialNB learns which words are common in spam or not spam messages.

Step 3: Making predictions

The model uses learned word patterns to guess if new messages are spam or not.

🔄

Alternative Approaches

Use TF-IDF Vectorizer with Logistic Regression
python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

texts = ['Free money now!!!', 'Hi, are we meeting today?', 'Congratulations, you won a prize!', 'Call me now', 'Hello friend']
labels = ['spam', 'not spam', 'spam', 'not spam', 'not spam']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print('Predictions:', predictions.tolist())
TF-IDF captures word importance better; Logistic Regression can be more accurate but slower to train.
Use a simple rule-based keyword filter
python
def classify_spam(text):
    spam_words = ['free', 'win', 'prize', 'money']
    text_lower = text.lower()
    for word in spam_words:
        if word in text_lower:
            return 'spam'
    return 'not spam'

print(classify_spam('Free money now!!!'))
print(classify_spam('Hello friend'))
Very fast and simple but less accurate and cannot learn from data.

Complexity: O(n*m) time, O(n*m) space

Time Complexity

Converting n messages with m unique words to vectors takes O(n*m). Training Naive Bayes is also O(n*m) since it counts word frequencies.

Space Complexity

Storing the word count matrix requires O(n*m) space, where n is messages and m is vocabulary size.

Which Approach is Fastest?

Rule-based filtering is fastest but least accurate; Naive Bayes balances speed and accuracy; Logistic Regression is slower but can improve accuracy.

ApproachTimeSpaceBest For
Naive Bayes with CountVectorizerO(n*m)O(n*m)Balanced speed and accuracy
Logistic Regression with TF-IDFO(n*m)O(n*m)Higher accuracy, slower training
Rule-based keyword filterO(n*k)O(k)Very fast, simple, low accuracy
💡
Always clean and preprocess text before training for better spam detection.
⚠️
Beginners often forget to convert text into numbers before training the model.