0
0
MlopsConceptBeginner · 3 min read

What is Pipeline in sklearn: Simple Explanation and Example

In sklearn, a Pipeline is a tool that chains multiple data processing steps and a model into one sequence. It helps automate workflows by applying transformations and training or predicting in a single call.
⚙️

How It Works

A Pipeline in sklearn works like a factory assembly line. Imagine you want to prepare a sandwich: first you slice the bread, then add fillings, and finally wrap it. Each step must happen in order to get the final sandwich.

Similarly, a pipeline chains steps like data cleaning, feature scaling, and model training. When you give data to the pipeline, it passes through each step one by one automatically. This keeps your code clean and avoids mistakes like forgetting to apply the same data transformations during testing.

💻

Example

This example shows a pipeline that scales data and then fits a logistic regression model. It trains on sample data and predicts new values.

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Create pipeline with scaler and model
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Train pipeline
pipe.fit(X_train, y_train)

# Predict on test data
predictions = pipe.predict(X_test)

# Show first 5 predictions
print(predictions[:5])
Output
[1 0 2 1 1]
🎯

When to Use

Use a Pipeline when you have multiple steps to prepare data and train a model. It ensures all steps run in the right order and the same way during training and testing.

For example, in real life, if you preprocess text data by cleaning and vectorizing before classification, a pipeline keeps these steps together. It also helps when tuning parameters because you can search over the whole process at once.

Key Points

  • Pipeline chains data transformations and model training into one object.
  • It helps avoid mistakes by applying the same steps during training and prediction.
  • Supports easy parameter tuning across all steps.
  • Makes code cleaner and easier to maintain.

Key Takeaways

A sklearn Pipeline chains multiple steps like scaling and modeling into one sequence.
It ensures consistent data processing during training and prediction.
Pipelines simplify code and reduce errors in machine learning workflows.
They allow easy tuning of parameters across all steps together.