Python sklearn Program to Analyze Sentiment with ML
CountVectorizer to convert text to numbers and LogisticRegression from sklearn to train a sentiment model; for example, model.fit(vectorizer.fit_transform(texts), labels) trains the model and model.predict(vectorizer.transform(new_texts)) predicts sentiment.Examples
How to Think About It
CountVectorizer. Then, train a simple model like LogisticRegression on labeled examples of positive and negative texts. Finally, use the trained model to predict the sentiment of new texts.Algorithm
Code
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression texts = ['I love this', 'This is bad', 'Amazing experience', 'Worst ever'] labels = [1, 0, 1, 0] # 1=positive, 0=negative vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression(max_iter=1000) model.fit(X, labels) new_texts = ['I hate this', 'What a great day'] X_new = vectorizer.transform(new_texts) predictions = model.predict(X_new) for text, pred in zip(new_texts, predictions): sentiment = 'Positive' if pred == 1 else 'Negative' print(f'Text: "{text}" -> Sentiment: {sentiment}')
Dry Run
Let's trace the example texts ['I hate this', 'What a great day'] through the code
Vectorize new texts
Convert ['I hate this', 'What a great day'] into numeric features using the learned vocabulary
Predict sentiment
Model predicts [0, 1] meaning Negative for first and Positive for second text
Print results
Output 'Negative' for 'I hate this' and 'Positive' for 'What a great day'
| Text | Vectorized Features | Prediction | Sentiment |
|---|---|---|---|
| I hate this | [0 0 1 1 0 1] | 0 | Negative |
| What a great day | [1 1 0 0 1 0] | 1 | Positive |
Why This Works
Step 1: Text to numbers
We use CountVectorizer to turn words into numbers so the model can understand text.
Step 2: Train model
The LogisticRegression model learns patterns from labeled examples to distinguish positive and negative sentiment.
Step 3: Predict sentiment
The model uses learned patterns to predict if new texts are positive or negative.
Alternative Approaches
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ['I love this', 'This is bad', 'Amazing experience', 'Worst ever'] labels = [1, 0, 1, 0] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression(max_iter=1000) model.fit(X, labels) new_texts = ['I hate this', 'What a great day'] X_new = vectorizer.transform(new_texts) predictions = model.predict(X_new) for text, pred in zip(new_texts, predictions): sentiment = 'Positive' if pred == 1 else 'Negative' print(f'Text: "{text}" -> Sentiment: {sentiment}')
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB texts = ['I love this', 'This is bad', 'Amazing experience', 'Worst ever'] labels = [1, 0, 1, 0] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = MultinomialNB() model.fit(X, labels) new_texts = ['I hate this', 'What a great day'] X_new = vectorizer.transform(new_texts) predictions = model.predict(X_new) for text, pred in zip(new_texts, predictions): sentiment = 'Positive' if pred == 1 else 'Negative' print(f'Text: "{text}" -> Sentiment: {sentiment}')
Complexity: O(n*m) time, O(n*m) space
Time Complexity
Vectorizing n texts with m unique words takes O(n*m), and training Logistic Regression depends on iterations but roughly O(n*m).
Space Complexity
Storing the vectorized matrix requires O(n*m) space, where n is number of texts and m is vocabulary size.
Which Approach is Fastest?
Naive Bayes trains faster than Logistic Regression but may be less accurate; TfidfVectorizer adds slight overhead but can improve results.
| Approach | Time | Space | Best For |
|---|---|---|---|
| CountVectorizer + LogisticRegression | O(n*m) | O(n*m) | Balanced accuracy and speed |
| TfidfVectorizer + LogisticRegression | O(n*m) | O(n*m) | Better accuracy with more computation |
| CountVectorizer + MultinomialNB | O(n*m) | O(n*m) | Fast training on small datasets |