0
0
MlopsHow-ToBeginner · 4 min read

How to Use Pipeline with Grid Search in Python | sklearn Guide

Use Pipeline to chain preprocessing and model steps, then pass it to GridSearchCV with a parameter grid using step names. This lets you tune parameters of all pipeline steps together in a clean, reusable way.
📐

Syntax

The basic syntax involves creating a Pipeline with named steps, then using GridSearchCV with a parameter grid that references these steps by their names.

Key parts:

  • Pipeline([('step_name', transformer_or_estimator), ...]): Chains steps.
  • GridSearchCV(estimator=pipeline, param_grid=param_grid): Searches best params.
  • Parameter names in param_grid use the format step_name__parameter_name.
python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__penalty': ['l2']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
💻

Example

This example shows how to create a pipeline with a scaler and logistic regression, then use grid search to find the best regularization parameter.

python
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Load data
X, y = load_iris(return_X_y=True)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=200))
])

# Parameter grid
param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__solver': ['liblinear']
}

# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=3)
grid_search.fit(X, y)

# Output best params and score
print('Best parameters:', grid_search.best_params_)
print('Best cross-validation accuracy:', grid_search.best_score_)
Output
Best parameters: {'clf__C': 1, 'clf__solver': 'liblinear'} Best cross-validation accuracy: 0.98
⚠️

Common Pitfalls

Common mistakes include:

  • Not using double underscores __ to separate step name and parameter name in param_grid.
  • Passing raw estimators to GridSearchCV instead of a pipeline when preprocessing is needed.
  • Forgetting to set max_iter for some models like logistic regression, causing convergence warnings.
python
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Wrong: param_grid keys missing step name
pipeline = Pipeline([('clf', LogisticRegression())])
param_grid_wrong = {'C': [0.1, 1, 10]}  # Missing 'clf__'

# This will cause an error
# grid_search = GridSearchCV(pipeline, param_grid_wrong)

# Right way
param_grid_right = {'clf__C': [0.1, 1, 10]}
grid_search = GridSearchCV(pipeline, param_grid_right)
📊

Quick Reference

ConceptDescriptionExample
PipelineChains preprocessing and model stepsPipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])
GridSearchCVSearches best parameters with cross-validationGridSearchCV(pipeline, param_grid, cv=5)
Parameter GridDictionary with step__param keys and list of values{'clf__C': [0.1, 1, 10]}
Step NamingUse step names to reference parameters'clf__C' for LogisticRegression's C parameter

Key Takeaways

Use sklearn Pipeline to combine preprocessing and model steps for clean workflows.
Pass the pipeline to GridSearchCV with a param_grid using step names and double underscores.
Always name pipeline steps to reference their parameters in grid search.
Set necessary model parameters like max_iter to avoid warnings during training.
Check parameter names carefully to avoid errors in GridSearchCV.