How to Do Sentiment Analysis with sklearn in Python
To do sentiment analysis with
sklearn in Python, convert text data into numbers using TfidfVectorizer, then train a classifier like LogisticRegression on labeled sentiment data. Finally, use the trained model to predict sentiment labels for new text.Syntax
Sentiment analysis with sklearn involves these steps:
TfidfVectorizer(): Converts text into numerical features.LogisticRegression(): A simple classifier to learn sentiment.fit(X_train, y_train): Train the model on vectorized text and labels.predict(X_test): Predict sentiment on new data.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression vectorizer = TfidfVectorizer() model = LogisticRegression() X_train = vectorizer.fit_transform(train_texts) y_train = train_labels model.fit(X_train, y_train) X_test = vectorizer.transform(test_texts) predictions = model.predict(X_test)
Example
This example shows how to train a sentiment analysis model on sample sentences labeled as positive or negative, then predict sentiment on new sentences.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # Sample training data train_texts = [ "I love this product", "This is the worst thing ever", "Absolutely fantastic experience", "I hate it", "Not good, very bad", "I am so happy with this" ] train_labels = ["positive", "negative", "positive", "negative", "negative", "positive"] # Vectorize text vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(train_texts) # Train model model = LogisticRegression() model.fit(X_train, train_labels) # New texts to predict test_texts = ["I really love this", "This is bad"] X_test = vectorizer.transform(test_texts) # Predict sentiment predictions = model.predict(X_test) print(predictions)
Output
['positive' 'negative']
Common Pitfalls
Common mistakes include:
- Not fitting the vectorizer on training data before transforming test data, causing errors or poor results.
- Using raw text directly without vectorization, which sklearn models cannot handle.
- Ignoring label encoding if using numeric labels; sklearn classifiers require consistent label formats.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # Wrong: Transform test data before fitting vectorizer vectorizer = TfidfVectorizer() # vectorizer.transform(test_texts) # This will fail because vectorizer is not fitted # Right way: vectorizer.fit(train_texts) X_train = vectorizer.transform(train_texts) X_test = vectorizer.transform(test_texts) model = LogisticRegression() model.fit(X_train, train_labels) predictions = model.predict(X_test)
Quick Reference
Tips for sklearn sentiment analysis:
- Always fit
TfidfVectorizeron training data only. - Use simple classifiers like
LogisticRegressionfor baseline models. - Preprocess text if needed (lowercase, remove punctuation) before vectorizing.
- Evaluate model with accuracy or other metrics on a test set.
Key Takeaways
Convert text to numbers using TfidfVectorizer before training sklearn models.
Train classifiers like LogisticRegression on labeled sentiment data.
Always fit vectorizer on training data, then transform test data.
Predict sentiment by applying the trained model to new vectorized text.
Check predictions and evaluate model accuracy on test data.