How to Create an ML Pipeline: Step-by-Step Guide
To create an ML pipeline, you connect steps like
data preprocessing, model training, and evaluation in a sequence that automates the workflow. Use tools like scikit-learn Pipeline to organize these steps cleanly and run them together.Syntax
An ML pipeline typically chains multiple steps: data cleaning, feature extraction, model training, and prediction. In Python's scikit-learn, you use the Pipeline class to define this sequence.
Each step has a name and a transformer or estimator object. The last step is usually the model.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ])
Example
This example shows a full ML pipeline that scales data, trains a logistic regression model, and evaluates accuracy on test data.
python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load data X, y = load_iris(return_X_y=True) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression(random_state=42)) ]) # Train model pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')
Output
Accuracy: 1.00
Common Pitfalls
- Not including preprocessing steps in the pipeline, which can cause data leakage if preprocessing is done before splitting data.
- Forgetting to fit the pipeline before predicting.
- Using incompatible transformers or models that do not follow the expected interface.
- Not setting random states for reproducibility.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Wrong: fitting scaler separately before pipeline scaler = StandardScaler() scaler.fit(X_train) # Data leakage risk if done before train/test split pipeline = Pipeline([ ('scaler', scaler), ('model', LogisticRegression(random_state=42)) ]) # Right: include scaler inside pipeline and fit once pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression(random_state=42)) ]) pipeline.fit(X_train, y_train)
Quick Reference
Remember these key points when creating ML pipelines:
- Use
Pipelineto chain preprocessing and modeling steps. - Fit the pipeline on training data only.
- Use
pipeline.predict()to get predictions. - Set random states for reproducibility.
- Keep pipeline steps simple and modular.
Key Takeaways
Use a pipeline to automate and organize ML workflow steps cleanly.
Include all preprocessing inside the pipeline to avoid data leakage.
Fit the pipeline only on training data before predicting.
Use pipeline methods like fit and predict for consistent results.
Set random states to make your experiments reproducible.