How to Use Naive Bayes for Text Classification in NLP
Use
Naive Bayes for text classification by converting text into numeric features (like word counts) and training a MultinomialNB model on labeled data. This model predicts the category of new text based on learned word probabilities.Syntax
To use Naive Bayes for text classification, follow these steps:
- Text Vectorization: Convert text into numbers using
CountVectorizerorTfidfVectorizer. - Model Training: Use
MultinomialNB()fromsklearn.naive_bayesto create the model. - Fit Model: Train the model with vectorized text and labels using
fit(). - Predict: Use
predict()on new vectorized text to get categories.
python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB # Sample training data train_texts = ['sample text data'] # Placeholder for actual training texts train_labels = ['label'] # Placeholder for actual labels # Step 1: Convert text to numbers vectorizer = CountVectorizer() X_train = vectorizer.fit_transform(train_texts) # Step 2: Create Naive Bayes model model = MultinomialNB() # Step 3: Train model model.fit(X_train, train_labels) # Step 4: Predict on new data # test_texts should be defined with new text samples X_test = vectorizer.transform(test_texts) predictions = model.predict(X_test)
Example
This example shows how to classify simple text messages into two categories: 'spam' or 'ham' (not spam) using Naive Bayes.
python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score # Sample training data train_texts = [ 'Free money now!!!', 'Hi, how are you?', 'Win a free lottery ticket', 'Hello friend, long time no see', 'Claim your free prize', 'Are we meeting today?' ] train_labels = ['spam', 'ham', 'spam', 'ham', 'spam', 'ham'] # Sample test data test_texts = [ 'Free prize for you', 'How about a meeting tomorrow?' ] true_labels = ['spam', 'ham'] # Vectorize text vectorizer = CountVectorizer() X_train = vectorizer.fit_transform(train_texts) X_test = vectorizer.transform(test_texts) # Train Naive Bayes model model = MultinomialNB() model.fit(X_train, train_labels) # Predict predictions = model.predict(X_test) # Show predictions and accuracy print('Predictions:', predictions) print('Accuracy:', accuracy_score(true_labels, predictions))
Output
Predictions: ['spam' 'ham']
Accuracy: 1.0
Common Pitfalls
Common mistakes when using Naive Bayes for text classification include:
- Not converting text to numeric features before training.
- Using raw text directly without vectorization.
- Ignoring the need to transform test data with the same vectorizer used on training data.
- Using the wrong Naive Bayes variant;
MultinomialNBis best for word counts.
python
from sklearn.naive_bayes import MultinomialNB # Wrong: Trying to fit raw text directly texts = ['hello world', 'free money'] labels = ['ham', 'spam'] model = MultinomialNB() # This will raise an error because texts are strings, not numbers # model.fit(texts, labels) # WRONG # Right way: from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model.fit(X, labels) # Correct
Quick Reference
Summary tips for using Naive Bayes in NLP:
- Always convert text to numeric features using
CountVectorizerorTfidfVectorizer. - Use
MultinomialNBfor text classification tasks. - Fit the model on training data and transform test data with the same vectorizer.
- Check model accuracy with
accuracy_scoreor similar metrics.
Key Takeaways
Convert text data into numeric features before training Naive Bayes.
Use MultinomialNB for text classification with word count features.
Always transform test data with the same vectorizer used on training data.
Check predictions with accuracy or other classification metrics.
Avoid fitting raw text directly to the model without vectorization.
