How to Use Logistic Regression for Text Classification in NLP
To use
logistic regression for text in NLP, first convert text into numerical features using methods like TF-IDF vectorization. Then, train a logistic regression model on these features to classify or predict text categories.Syntax
Using logistic regression for text involves these steps:
- Text vectorization: Convert text into numbers using
TfidfVectorizer. - Model training: Use
LogisticRegressionto fit the vectorized data and labels. - Prediction: Use the trained model to predict new text labels.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # Step 1: Vectorize text vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(text_data) # Step 2: Train logistic regression model = LogisticRegression(max_iter=1000) model.fit(X, labels) # Step 3: Predict predictions = model.predict(X_new)
Example
This example shows how to classify simple text messages as spam or not spam using logistic regression.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample text data and labels texts = [ "Free money now!!!", "Hi, how are you?", "Win a free ticket", "Let's meet tomorrow", "Congratulations, you won!", "Are you coming to the party?" ] labels = [1, 0, 1, 0, 1, 0] # 1 = spam, 0 = not spam # Split data X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42) # Vectorize text vectorizer = TfidfVectorizer() X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) # Train logistic regression model = LogisticRegression(max_iter=1000) model.fit(X_train_vec, y_train) # Predict on test data predictions = model.predict(X_test_vec) # Calculate accuracy accuracy = accuracy_score(y_test, predictions) print(f"Accuracy: {accuracy:.2f}")
Output
Accuracy: 1.00
Common Pitfalls
Common mistakes when using logistic regression for text include:
- Not converting text to numerical features before training.
- Using raw text without cleaning or preprocessing (like lowercasing or removing punctuation).
- Ignoring train-test split, which can cause overfitting.
- Not tuning model parameters or vectorizer settings.
python
from sklearn.linear_model import LogisticRegression # Wrong: Trying to fit raw text directly texts = ["Hello world", "Free money"] labels = [0, 1] model = LogisticRegression() try: model.fit(texts, labels) # This will raise an error except Exception as e: print(f"Error: {e}") # Right: Vectorize text first from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) model.fit(X, labels) print("Model trained successfully after vectorization.")
Output
Error: Expected 2D array, got 1D array instead: array=['Hello world' 'Free money'].
Model trained successfully after vectorization.
Quick Reference
Tips for using logistic regression with text data:
- Always convert text to numeric vectors using
TfidfVectorizerorCountVectorizer. - Split data into training and testing sets to evaluate performance.
- Use
LogisticRegressionfromsklearn.linear_modelfor classification. - Tune vectorizer parameters like
max_featuresandngram_rangefor better results. - Check model accuracy with metrics like
accuracy_score.
Key Takeaways
Convert text to numeric vectors using TF-IDF before applying logistic regression.
Always split data into training and testing sets to avoid overfitting.
Logistic regression works well for binary text classification tasks.
Preprocessing text (like lowercasing) improves model performance.
Evaluate your model with accuracy or other classification metrics.
