0
0
ML Pythonml~5 mins

Text classification pipeline in ML Python

Choose your learning style9 modes available
Introduction

We use a text classification pipeline to teach a computer how to sort text into groups. This helps us quickly understand or organize lots of text.

Sorting emails into spam or not spam
Classifying customer reviews as positive or negative
Organizing news articles by topic like sports or politics
Filtering social media posts by sentiment
Detecting language of a text automatically
Syntax
ML Python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

The pipeline chains steps: first it changes text into numbers, then it learns to classify.

Each step has a name and a model or transformer.

Examples
This example removes common English words and allows more training steps for better learning.
ML Python
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=1000))
])
This uses a simple word count and a Naive Bayes classifier, good for quick text sorting.
ML Python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])
Sample Model

This program trains a text classifier to tell positive from negative reviews. It shows predictions and accuracy on test data.

ML Python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample text data and labels
texts = [
    'I love this product',
    'This is the worst thing ever',
    'Absolutely fantastic experience',
    'I hate it',
    'Not bad, could be better',
    'I am very happy with this',
    'Terrible, do not buy',
    'Best purchase I made',
    'Awful, waste of money',
    'Pretty good overall'
]
labels = [1, 0, 1, 0, 0, 1, 0, 1, 0, 1]  # 1=positive, 0=negative

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict on test data
predictions = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

print(f'Predictions: {predictions}')
print(f'Accuracy: {accuracy:.2f}')
OutputSuccess
Important Notes

Always split your data into training and testing sets to check how well your model works on new text.

TfidfVectorizer turns text into numbers by counting words and how important they are.

LogisticRegression is a simple but effective model for two-class text classification.

Summary

A text classification pipeline turns raw text into numbers and then learns to sort it.

It helps automate sorting tasks like spam detection or sentiment analysis.

Using pipelines keeps your code clean and easy to manage.