MlopsHow-ToBeginner · 4 min read

How to Use Pipeline with Grid Search in Python | sklearn Guide

Use Pipeline to chain preprocessing and model steps, then pass it to GridSearchCV with a parameter grid using step names. This lets you tune parameters of all pipeline steps together in a clean, reusable way.

📐

Syntax

The basic syntax involves creating a Pipeline with named steps, then using GridSearchCV with a parameter grid that references these steps by their names.

Key parts:

Pipeline([('step_name', transformer_or_estimator), ...]): Chains steps.
GridSearchCV(estimator=pipeline, param_grid=param_grid): Searches best params.
Parameter names in param_grid use the format step_name__parameter_name.

python

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__penalty': ['l2']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)

💻

Example

This example shows how to create a pipeline with a scaler and logistic regression, then use grid search to find the best regularization parameter.

python

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Load data
X, y = load_iris(return_X_y=True)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=200))
])

# Parameter grid
param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__solver': ['liblinear']
}

# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=3)
grid_search.fit(X, y)

# Output best params and score
print('Best parameters:', grid_search.best_params_)
print('Best cross-validation accuracy:', grid_search.best_score_)

Output

Best parameters: {'clf__C': 1, 'clf__solver': 'liblinear'} Best cross-validation accuracy: 0.98

⚠️

Common Pitfalls

Common mistakes include:

Not using double underscores __ to separate step name and parameter name in param_grid.
Passing raw estimators to GridSearchCV instead of a pipeline when preprocessing is needed.
Forgetting to set max_iter for some models like logistic regression, causing convergence warnings.

python

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Wrong: param_grid keys missing step name
pipeline = Pipeline([('clf', LogisticRegression())])
param_grid_wrong = {'C': [0.1, 1, 10]}  # Missing 'clf__'

# This will cause an error
# grid_search = GridSearchCV(pipeline, param_grid_wrong)

# Right way
param_grid_right = {'clf__C': [0.1, 1, 10]}
grid_search = GridSearchCV(pipeline, param_grid_right)

📊

Quick Reference

Concept	Description	Example
Pipeline	Chains preprocessing and model steps	Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])
GridSearchCV	Searches best parameters with cross-validation	GridSearchCV(pipeline, param_grid, cv=5)
Parameter Grid	Dictionary with step__param keys and list of values	{'clf__C': [0.1, 1, 10]}
Step Naming	Use step names to reference parameters	'clf__C' for LogisticRegression's C parameter

✅

Key Takeaways

Use sklearn Pipeline to combine preprocessing and model steps for clean workflows.

Pass the pipeline to GridSearchCV with a param_grid using step names and double underscores.

Always name pipeline steps to reference their parameters in grid search.

Set necessary model parameters like max_iter to avoid warnings during training.

Check parameter names carefully to avoid errors in GridSearchCV.