We use a text classification pipeline to teach a computer how to sort text into groups. This helps us quickly understand or organize lots of text.
Text classification pipeline in ML Python
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('vectorizer', TfidfVectorizer()), ('classifier', LogisticRegression()) ])
The pipeline chains steps: first it changes text into numbers, then it learns to classify.
Each step has a name and a model or transformer.
pipeline = Pipeline([
('vectorizer', TfidfVectorizer(stop_words='english')),
('classifier', LogisticRegression(max_iter=1000))
])from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ])
This program trains a text classifier to tell positive from negative reviews. It shows predictions and accuracy on test data.
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample text data and labels texts = [ 'I love this product', 'This is the worst thing ever', 'Absolutely fantastic experience', 'I hate it', 'Not bad, could be better', 'I am very happy with this', 'Terrible, do not buy', 'Best purchase I made', 'Awful, waste of money', 'Pretty good overall' ] labels = [1, 0, 1, 0, 0, 1, 0, 1, 0, 1] # 1=positive, 0=negative # Split data X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42) # Create pipeline pipeline = Pipeline([ ('vectorizer', TfidfVectorizer()), ('classifier', LogisticRegression(max_iter=1000)) ]) # Train model pipeline.fit(X_train, y_train) # Predict on test data predictions = pipeline.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, predictions) print(f'Predictions: {predictions}') print(f'Accuracy: {accuracy:.2f}')
Always split your data into training and testing sets to check how well your model works on new text.
TfidfVectorizer turns text into numbers by counting words and how important they are.
LogisticRegression is a simple but effective model for two-class text classification.
A text classification pipeline turns raw text into numbers and then learns to sort it.
It helps automate sorting tasks like spam detection or sentiment analysis.
Using pipelines keeps your code clean and easy to manage.