How to create end to end ml pipeline in python

MlopsHow-ToBeginner · 3 min read

Create End-to-End ML Pipeline in Python with sklearn

To create an end-to-end ML pipeline in Python, use sklearn.pipeline.Pipeline to chain preprocessing steps and a model into one object. This lets you fit, transform, and predict in a clean, repeatable way with fit() and predict() methods.

📐

Syntax

An ML pipeline in sklearn is created using Pipeline which takes a list of steps. Each step is a tuple with a name and a transformer or estimator. The last step must be an estimator (model). Use fit() to train and predict() to get predictions.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Data scaling step
    ('model', LogisticRegression())  # Model training step
])

💻

Example

This example shows a full pipeline that scales data, trains a logistic regression model on the iris dataset, and evaluates accuracy.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Output

Accuracy: 1.00

⚠️

Common Pitfalls

Not including preprocessing steps inside the pipeline can cause data leakage if you preprocess train and test data separately.
Forgetting to set random_state in train-test split or model can cause inconsistent results.
Using incompatible transformers or forgetting the final estimator step causes errors.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Wrong: fitting scaler separately (can cause leakage)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Right: use pipeline to avoid leakage
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)

📊

Quick Reference

Remember these key steps for an sklearn ML pipeline:

Define steps: List of (name, transformer/estimator) tuples.
Fit pipeline: Use pipeline.fit(X_train, y_train).
Predict: Use pipeline.predict(X_test).
Evaluate: Use metrics like accuracy on predictions.
Prevent leakage: Include all preprocessing inside the pipeline.

✅

Key Takeaways

Use sklearn's Pipeline to combine preprocessing and model steps into one object.

Always include data transformations inside the pipeline to avoid data leakage.

Fit the pipeline on training data and use it to predict on new data.

Set random_state for reproducible train-test splits and model results.

Evaluate model performance using appropriate metrics after prediction.