How to Use Pipeline with Grid Search in Python | sklearn Guide
Use
Pipeline to chain preprocessing and model steps, then pass it to GridSearchCV with a parameter grid using step names. This lets you tune parameters of all pipeline steps together in a clean, reusable way.Syntax
The basic syntax involves creating a Pipeline with named steps, then using GridSearchCV with a parameter grid that references these steps by their names.
Key parts:
Pipeline([('step_name', transformer_or_estimator), ...]): Chains steps.GridSearchCV(estimator=pipeline, param_grid=param_grid): Searches best params.- Parameter names in
param_griduse the formatstep_name__parameter_name.
python
from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression()) ]) param_grid = { 'clf__C': [0.1, 1, 10], 'clf__penalty': ['l2'] } grid_search = GridSearchCV(pipeline, param_grid, cv=5)
Example
This example shows how to create a pipeline with a scaler and logistic regression, then use grid search to find the best regularization parameter.
python
from sklearn.datasets import load_iris from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV # Load data X, y = load_iris(return_X_y=True) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=200)) ]) # Parameter grid param_grid = { 'clf__C': [0.1, 1, 10], 'clf__solver': ['liblinear'] } # Grid search grid_search = GridSearchCV(pipeline, param_grid, cv=3) grid_search.fit(X, y) # Output best params and score print('Best parameters:', grid_search.best_params_) print('Best cross-validation accuracy:', grid_search.best_score_)
Output
Best parameters: {'clf__C': 1, 'clf__solver': 'liblinear'}
Best cross-validation accuracy: 0.98
Common Pitfalls
Common mistakes include:
- Not using double underscores
__to separate step name and parameter name inparam_grid. - Passing raw estimators to
GridSearchCVinstead of a pipeline when preprocessing is needed. - Forgetting to set
max_iterfor some models like logistic regression, causing convergence warnings.
python
from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV # Wrong: param_grid keys missing step name pipeline = Pipeline([('clf', LogisticRegression())]) param_grid_wrong = {'C': [0.1, 1, 10]} # Missing 'clf__' # This will cause an error # grid_search = GridSearchCV(pipeline, param_grid_wrong) # Right way param_grid_right = {'clf__C': [0.1, 1, 10]} grid_search = GridSearchCV(pipeline, param_grid_right)
Quick Reference
| Concept | Description | Example |
|---|---|---|
| Pipeline | Chains preprocessing and model steps | Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())]) |
| GridSearchCV | Searches best parameters with cross-validation | GridSearchCV(pipeline, param_grid, cv=5) |
| Parameter Grid | Dictionary with step__param keys and list of values | {'clf__C': [0.1, 1, 10]} |
| Step Naming | Use step names to reference parameters | 'clf__C' for LogisticRegression's C parameter |
Key Takeaways
Use sklearn Pipeline to combine preprocessing and model steps for clean workflows.
Pass the pipeline to GridSearchCV with a param_grid using step names and double underscores.
Always name pipeline steps to reference their parameters in grid search.
Set necessary model parameters like max_iter to avoid warnings during training.
Check parameter names carefully to avoid errors in GridSearchCV.