How to Use Pipeline in sklearn for Clean ML Workflows
Use
Pipeline in sklearn to chain multiple steps like data preprocessing and model training into one object. This helps run all steps together with fit and predict methods, making your code cleaner and less error-prone.Syntax
The Pipeline is created by passing a list of named steps, where each step is a tuple with a name and a transformer or estimator. The last step is usually a model. You call fit to train all steps and predict to get predictions.
- steps: List of tuples like (
'name', transformer/estimator) - fit(): Trains all steps in order
- predict(): Runs data through all steps and outputs predictions
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) # Fit the pipeline on training data pipeline.fit(X_train, y_train) # Predict on new data predictions = pipeline.predict(X_test)
Example
This example shows how to use Pipeline to scale features and train a logistic regression model on the iris dataset. It fits the pipeline and prints the accuracy on test data.
python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load data iris = load_iris() X, y = iris.data, iris.target # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=200)) ]) # Train pipeline pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}')
Output
Accuracy: 1.00
Common Pitfalls
Common mistakes when using Pipeline include:
- Not naming steps uniquely, which causes errors.
- Trying to use
transformon a pipeline that ends with a model that does not support it. - Forgetting to call
fitbeforepredict. - Passing raw data that needs preprocessing outside the pipeline.
Always ensure the last step is an estimator with fit and predict methods.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Wrong: duplicate step names # pipeline = Pipeline([ # ('scaler', StandardScaler()), # ('scaler', StandardScaler()), # Error: duplicate name # ('model', LogisticRegression()) # ]) # Right: unique step names pipeline = Pipeline([ ('scaler1', StandardScaler()), ('scaler2', StandardScaler()), ('model', LogisticRegression()) ])
Quick Reference
Remember these tips when using Pipeline:
- Each step is a tuple: (
'name', transformer/estimator). - The last step must be an estimator with
fitandpredict. - Use
fitonce to train all steps. - Use
predictto get predictions after fitting. - Pipeline helps avoid data leakage by applying preprocessing inside the pipeline.
Key Takeaways
Use sklearn's Pipeline to chain preprocessing and modeling steps into one object.
Name each step uniquely and ensure the last step is a model with fit and predict.
Call fit on the pipeline to train all steps together, then use predict for results.
Pipeline prevents data leakage by applying transformations only on training data during fit.
Avoid calling transform on pipelines ending with models that do not support it.