How to use Pipeline class sklearn in python

MlopsHow-ToBeginner · 4 min read

How to Use Pipeline Class in sklearn with Python

Use the Pipeline class from sklearn.pipeline to chain multiple data processing steps and a model into one object. Define steps as a list of tuples with a name and transformer or estimator, then fit and predict using the pipeline like a single model.

📐

Syntax

The Pipeline class is created by passing a list of steps, where each step is a tuple containing a name (string) and a transformer or estimator object. The last step is usually a model, and earlier steps are data transformers.

Example syntax:

Pipeline(steps=[('name1', transformer1), ('name2', transformer2), ('model', estimator)])

Use fit() to train and predict() to make predictions on new data.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])

💻

Example

This example shows how to create a pipeline that scales features and then fits a logistic regression model on the Iris dataset. It demonstrates training and prediction with the pipeline.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)

# Create pipeline
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Output

Accuracy: 1.00

⚠️

Common Pitfalls

Not naming steps uniquely: Each step name must be unique strings.
Placing the model before transformers: The model should be the last step.
Trying to use fit_transform() on the whole pipeline when the last step is an estimator without this method.
Forgetting to call fit() before predict().

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Wrong: model before scaler
wrong_pipeline = Pipeline(steps=[
    ('logreg', LogisticRegression()),
    ('scaler', StandardScaler())
])

# Right order
right_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])

📊

Quick Reference

Method	Description
fit(X, y)	Train all steps in the pipeline on data X and labels y
predict(X)	Make predictions using the final estimator on data X
fit_transform(X, y=None)	Fit and transform data through all transformers (except final estimator)
named_steps	Access steps by name, e.g., pipeline.named_steps['scaler']
set_params(**params)	Set parameters of steps using step__param syntax

✅

Key Takeaways

Use sklearn's Pipeline to chain preprocessing and modeling steps into one object.

Always put transformers before the final estimator in the steps list.

Call fit() on the pipeline to train all steps together.

Use unique names for each step to avoid errors.

Access individual steps via pipeline.named_steps for inspection or tuning.