How to create ML pipeline

Ml-pythonHow-ToBeginner · 4 min read

How to Create an ML Pipeline: Step-by-Step Guide

To create an ML pipeline, you connect steps like data preprocessing, model training, and evaluation in a sequence that automates the workflow. Use tools like scikit-learn Pipeline to organize these steps cleanly and run them together.

📐

Syntax

An ML pipeline typically chains multiple steps: data cleaning, feature extraction, model training, and prediction. In Python's scikit-learn, you use the Pipeline class to define this sequence.

Each step has a name and a transformer or estimator object. The last step is usually the model.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

💻

Example

This example shows a full ML pipeline that scales data, trains a logistic regression model, and evaluates accuracy on test data.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(random_state=42))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Output

Accuracy: 1.00

⚠️

Common Pitfalls

Not including preprocessing steps in the pipeline, which can cause data leakage if preprocessing is done before splitting data.
Forgetting to fit the pipeline before predicting.
Using incompatible transformers or models that do not follow the expected interface.
Not setting random states for reproducibility.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Wrong: fitting scaler separately before pipeline
scaler = StandardScaler()
scaler.fit(X_train)  # Data leakage risk if done before train/test split

pipeline = Pipeline([
    ('scaler', scaler),
    ('model', LogisticRegression(random_state=42))
])

# Right: include scaler inside pipeline and fit once
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)

📊

Quick Reference

Remember these key points when creating ML pipelines:

Use Pipeline to chain preprocessing and modeling steps.
Fit the pipeline on training data only.
Use pipeline.predict() to get predictions.
Set random states for reproducibility.
Keep pipeline steps simple and modular.

✅

Key Takeaways

Use a pipeline to automate and organize ML workflow steps cleanly.

Include all preprocessing inside the pipeline to avoid data leakage.

Fit the pipeline only on training data before predicting.

Use pipeline methods like fit and predict for consistent results.

Set random states to make your experiments reproducible.