How to Use Pipeline Class in sklearn with Python
Use the
Pipeline class from sklearn.pipeline to chain multiple data processing steps and a model into one object. Define steps as a list of tuples with a name and transformer or estimator, then fit and predict using the pipeline like a single model.Syntax
The Pipeline class is created by passing a list of steps, where each step is a tuple containing a name (string) and a transformer or estimator object. The last step is usually a model, and earlier steps are data transformers.
Example syntax:
Pipeline(steps=[('name1', transformer1), ('name2', transformer2), ('model', estimator)])Use fit() to train and predict() to make predictions on new data.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline(steps=[ ('scaler', StandardScaler()), ('logreg', LogisticRegression()) ])
Example
This example shows how to create a pipeline that scales features and then fits a logistic regression model on the Iris dataset. It demonstrates training and prediction with the pipeline.
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42) # Create pipeline pipeline = Pipeline(steps=[ ('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=200)) ]) # Train pipeline pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}")
Output
Accuracy: 1.00
Common Pitfalls
- Not naming steps uniquely: Each step name must be unique strings.
- Placing the model before transformers: The model should be the last step.
- Trying to use
fit_transform()on the whole pipeline when the last step is an estimator without this method. - Forgetting to call
fit()beforepredict().
python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Wrong: model before scaler wrong_pipeline = Pipeline(steps=[ ('logreg', LogisticRegression()), ('scaler', StandardScaler()) ]) # Right order right_pipeline = Pipeline(steps=[ ('scaler', StandardScaler()), ('logreg', LogisticRegression()) ])
Quick Reference
| Method | Description |
|---|---|
| fit(X, y) | Train all steps in the pipeline on data X and labels y |
| predict(X) | Make predictions using the final estimator on data X |
| fit_transform(X, y=None) | Fit and transform data through all transformers (except final estimator) |
| named_steps | Access steps by name, e.g., pipeline.named_steps['scaler'] |
| set_params(**params) | Set parameters of steps using step__param syntax |
Key Takeaways
Use sklearn's Pipeline to chain preprocessing and modeling steps into one object.
Always put transformers before the final estimator in the steps list.
Call fit() on the pipeline to train all steps together.
Use unique names for each step to avoid errors.
Access individual steps via pipeline.named_steps for inspection or tuning.