Bird
Raised Fist0
NlpHow-ToBeginner · 3 min read

How to Use Naive Bayes for Text Classification in NLP

Use Naive Bayes for text classification by converting text into numeric features (like word counts) and training a MultinomialNB model on labeled data. This model predicts the category of new text based on learned word probabilities.
📐

Syntax

To use Naive Bayes for text classification, follow these steps:

  • Text Vectorization: Convert text into numbers using CountVectorizer or TfidfVectorizer.
  • Model Training: Use MultinomialNB() from sklearn.naive_bayes to create the model.
  • Fit Model: Train the model with vectorized text and labels using fit().
  • Predict: Use predict() on new vectorized text to get categories.
python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample training data
train_texts = ['sample text data']  # Placeholder for actual training texts
train_labels = ['label']  # Placeholder for actual labels

# Step 1: Convert text to numbers
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)

# Step 2: Create Naive Bayes model
model = MultinomialNB()

# Step 3: Train model
model.fit(X_train, train_labels)

# Step 4: Predict on new data
# test_texts should be defined with new text samples
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)
💻

Example

This example shows how to classify simple text messages into two categories: 'spam' or 'ham' (not spam) using Naive Bayes.

python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample training data
train_texts = [
    'Free money now!!!',
    'Hi, how are you?',
    'Win a free lottery ticket',
    'Hello friend, long time no see',
    'Claim your free prize',
    'Are we meeting today?'
]
train_labels = ['spam', 'ham', 'spam', 'ham', 'spam', 'ham']

# Sample test data
test_texts = [
    'Free prize for you',
    'How about a meeting tomorrow?'
]
true_labels = ['spam', 'ham']

# Vectorize text
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Train Naive Bayes model
model = MultinomialNB()
model.fit(X_train, train_labels)

# Predict
predictions = model.predict(X_test)

# Show predictions and accuracy
print('Predictions:', predictions)
print('Accuracy:', accuracy_score(true_labels, predictions))
Output
Predictions: ['spam' 'ham'] Accuracy: 1.0
⚠️

Common Pitfalls

Common mistakes when using Naive Bayes for text classification include:

  • Not converting text to numeric features before training.
  • Using raw text directly without vectorization.
  • Ignoring the need to transform test data with the same vectorizer used on training data.
  • Using the wrong Naive Bayes variant; MultinomialNB is best for word counts.
python
from sklearn.naive_bayes import MultinomialNB

# Wrong: Trying to fit raw text directly
texts = ['hello world', 'free money']
labels = ['ham', 'spam']
model = MultinomialNB()

# This will raise an error because texts are strings, not numbers
# model.fit(texts, labels)  # WRONG

# Right way:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model.fit(X, labels)  # Correct
📊

Quick Reference

Summary tips for using Naive Bayes in NLP:

  • Always convert text to numeric features using CountVectorizer or TfidfVectorizer.
  • Use MultinomialNB for text classification tasks.
  • Fit the model on training data and transform test data with the same vectorizer.
  • Check model accuracy with accuracy_score or similar metrics.

Key Takeaways

Convert text data into numeric features before training Naive Bayes.
Use MultinomialNB for text classification with word count features.
Always transform test data with the same vectorizer used on training data.
Check predictions with accuracy or other classification metrics.
Avoid fitting raw text directly to the model without vectorization.