MlopsHow-ToBeginner · 3 min read

How to Use Pipeline in sklearn for Clean ML Workflows

Use Pipeline in sklearn to chain multiple steps like data preprocessing and model training into one object. This helps run all steps together with fit and predict methods, making your code cleaner and less error-prone.

📐

Syntax

The Pipeline is created by passing a list of named steps, where each step is a tuple with a name and a transformer or estimator. The last step is usually a model. You call fit to train all steps and predict to get predictions.

steps: List of tuples like ('name', transformer/estimator)
fit(): Trains all steps in order
predict(): Runs data through all steps and outputs predictions

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Fit the pipeline on training data
pipeline.fit(X_train, y_train)

# Predict on new data
predictions = pipeline.predict(X_test)

💻

Example

This example shows how to use Pipeline to scale features and train a logistic regression model on the iris dataset. It fits the pipeline and prints the accuracy on test data.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Output

Accuracy: 1.00

⚠️

Common Pitfalls

Common mistakes when using Pipeline include:

Not naming steps uniquely, which causes errors.
Trying to use transform on a pipeline that ends with a model that does not support it.
Forgetting to call fit before predict.
Passing raw data that needs preprocessing outside the pipeline.

Always ensure the last step is an estimator with fit and predict methods.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Wrong: duplicate step names
# pipeline = Pipeline([
#     ('scaler', StandardScaler()),
#     ('scaler', StandardScaler()),  # Error: duplicate name
#     ('model', LogisticRegression())
# ])

# Right: unique step names
pipeline = Pipeline([
    ('scaler1', StandardScaler()),
    ('scaler2', StandardScaler()),
    ('model', LogisticRegression())
])

📊

Quick Reference

Remember these tips when using Pipeline:

Each step is a tuple: ('name', transformer/estimator).
The last step must be an estimator with fit and predict.
Use fit once to train all steps.
Use predict to get predictions after fitting.
Pipeline helps avoid data leakage by applying preprocessing inside the pipeline.

✅

Key Takeaways

Use sklearn's Pipeline to chain preprocessing and modeling steps into one object.

Name each step uniquely and ensure the last step is a model with fit and predict.

Call fit on the pipeline to train all steps together, then use predict for results.

Pipeline prevents data leakage by applying transformations only on training data during fit.

Avoid calling transform on pipelines ending with models that do not support it.