How to Create Pipeline in sklearn with Python - Simple Guide
sklearn, use the Pipeline class by passing a list of steps as tuples with a name and a transformer or estimator. This lets you chain preprocessing and modeling steps into one object for easy training and prediction.Syntax
The Pipeline class in sklearn.pipeline takes a list of steps, where each step is a tuple with a name and a transformer or estimator. The last step is usually an estimator (model), and earlier steps are transformers (like scalers or feature selectors).
Example syntax:
Pipeline([('step_name1', transformer1), ('step_name2', estimator)])
Each step name is a string identifier, and the transformer or estimator is an object implementing fit and transform (for transformers) or fit and predict (for estimators).
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), ('logreg', LogisticRegression()) ])
Example
This example shows how to create a pipeline that scales features and then fits a logistic regression model on the Iris dataset. It trains the pipeline and prints the accuracy on test data.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=200)) ]) # Train pipeline pipeline.fit(X_train, y_train) # Predict and evaluate predictions = pipeline.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f'Accuracy: {accuracy:.2f}')
Common Pitfalls
1. Forgetting to name steps uniquely: Each step name must be unique strings, or sklearn will raise an error.
2. Putting estimator before transformer: The last step must be an estimator; transformers should come before.
3. Not using pipeline for both training and prediction: Always use the pipeline object to call fit, predict, or transform to ensure all steps run correctly.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Wrong: Estimator before transformer wrong_pipeline = Pipeline([ ('logreg', LogisticRegression()), ('scaler', StandardScaler()) ]) # This will cause errors # Right: Transformer before estimator right_pipeline = Pipeline([ ('scaler', StandardScaler()), ('logreg', LogisticRegression()) ])
Quick Reference
Pipeline Cheat Sheet:
Pipeline(steps=[('name', transformer_or_estimator), ...]): Create pipelinepipeline.fit(X, y): Train all stepspipeline.predict(X): Predict using last estimatorpipeline.transform(X): Transform data using all but last step- Use unique step names
- Last step must be an estimator