What is Text classification pipeline in ML Python?

ML Pythonml~5 mins

Text classification pipeline in ML Python

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Introduction

We use a text classification pipeline to teach a computer how to sort text into groups. This helps us quickly understand or organize lots of text.

Sorting emails into spam or not spam

Classifying customer reviews as positive or negative

Organizing news articles by topic like sports or politics

Filtering social media posts by sentiment

Detecting language of a text automatically

Syntax

ML Python

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

The pipeline chains steps: first it changes text into numbers, then it learns to classify.

Each step has a name and a model or transformer.

Examples

This example removes common English words and allows more training steps for better learning.

ML Python

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=1000))
])

This uses a simple word count and a Naive Bayes classifier, good for quick text sorting.

ML Python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

Sample Model

This program trains a text classifier to tell positive from negative reviews. It shows predictions and accuracy on test data.

ML Python

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample text data and labels
texts = [
    'I love this product',
    'This is the worst thing ever',
    'Absolutely fantastic experience',
    'I hate it',
    'Not bad, could be better',
    'I am very happy with this',
    'Terrible, do not buy',
    'Best purchase I made',
    'Awful, waste of money',
    'Pretty good overall'
]
labels = [1, 0, 1, 0, 0, 1, 0, 1, 0, 1]  # 1=positive, 0=negative

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict on test data
predictions = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

print(f'Predictions: {predictions}')
print(f'Accuracy: {accuracy:.2f}')

OutputSuccess

Important Notes

Always split your data into training and testing sets to check how well your model works on new text.

TfidfVectorizer turns text into numbers by counting words and how important they are.

LogisticRegression is a simple but effective model for two-class text classification.

Summary

A text classification pipeline turns raw text into numbers and then learns to sort it.

It helps automate sorting tasks like spam detection or sentiment analysis.

Using pipelines keeps your code clean and easy to manage.