Create End-to-End ML Pipeline in Python with sklearn
To create an end-to-end ML pipeline in Python, use
sklearn.pipeline.Pipeline to chain preprocessing steps and a model into one object. This lets you fit, transform, and predict in a clean, repeatable way with fit() and predict() methods.Syntax
An ML pipeline in sklearn is created using Pipeline which takes a list of steps. Each step is a tuple with a name and a transformer or estimator. The last step must be an estimator (model). Use fit() to train and predict() to get predictions.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), # Data scaling step ('model', LogisticRegression()) # Model training step ])
Example
This example shows a full pipeline that scales data, trains a logistic regression model on the iris dataset, and evaluates accuracy.
python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=200)) ]) # Train model pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')
Output
Accuracy: 1.00
Common Pitfalls
- Not including preprocessing steps inside the pipeline can cause data leakage if you preprocess train and test data separately.
- Forgetting to set
random_statein train-test split or model can cause inconsistent results. - Using incompatible transformers or forgetting the final estimator step causes errors.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Wrong: fitting scaler separately (can cause leakage) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) model = LogisticRegression() model.fit(X_train_scaled, y_train) # Right: use pipeline to avoid leakage pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) pipeline.fit(X_train, y_train)
Quick Reference
Remember these key steps for an sklearn ML pipeline:
- Define steps: List of (name, transformer/estimator) tuples.
- Fit pipeline: Use
pipeline.fit(X_train, y_train). - Predict: Use
pipeline.predict(X_test). - Evaluate: Use metrics like accuracy on predictions.
- Prevent leakage: Include all preprocessing inside the pipeline.
Key Takeaways
Use sklearn's Pipeline to combine preprocessing and model steps into one object.
Always include data transformations inside the pipeline to avoid data leakage.
Fit the pipeline on training data and use it to predict on new data.
Set random_state for reproducible train-test splits and model results.
Evaluate model performance using appropriate metrics after prediction.