0
0
MlopsHow-ToBeginner · 3 min read

How to Create Pipeline in sklearn with Python - Simple Guide

To create a pipeline in sklearn, use the Pipeline class by passing a list of steps as tuples with a name and a transformer or estimator. This lets you chain preprocessing and modeling steps into one object for easy training and prediction.
📐

Syntax

The Pipeline class in sklearn.pipeline takes a list of steps, where each step is a tuple with a name and a transformer or estimator. The last step is usually an estimator (model), and earlier steps are transformers (like scalers or feature selectors).

Example syntax:

  • Pipeline([('step_name1', transformer1), ('step_name2', estimator)])

Each step name is a string identifier, and the transformer or estimator is an object implementing fit and transform (for transformers) or fit and predict (for estimators).

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])
💻

Example

This example shows how to create a pipeline that scales features and then fits a logistic regression model on the Iris dataset. It trains the pipeline and prints the accuracy on test data.

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict and evaluate
predictions = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')
Output
Accuracy: 1.00
⚠️

Common Pitfalls

1. Forgetting to name steps uniquely: Each step name must be unique strings, or sklearn will raise an error.

2. Putting estimator before transformer: The last step must be an estimator; transformers should come before.

3. Not using pipeline for both training and prediction: Always use the pipeline object to call fit, predict, or transform to ensure all steps run correctly.

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Wrong: Estimator before transformer
wrong_pipeline = Pipeline([
    ('logreg', LogisticRegression()),
    ('scaler', StandardScaler())
])  # This will cause errors

# Right: Transformer before estimator
right_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])
📊

Quick Reference

Pipeline Cheat Sheet:

  • Pipeline(steps=[('name', transformer_or_estimator), ...]): Create pipeline
  • pipeline.fit(X, y): Train all steps
  • pipeline.predict(X): Predict using last estimator
  • pipeline.transform(X): Transform data using all but last step
  • Use unique step names
  • Last step must be an estimator

Key Takeaways

Use sklearn.pipeline.Pipeline to chain preprocessing and modeling steps in one object.
Name each pipeline step uniquely and put transformers before the final estimator.
Fit and predict using the pipeline object to apply all steps correctly.
Pipelines help keep code clean and avoid data leakage during training and testing.