0
0
ML Pythonml~5 mins

Pipeline with GridSearchCV in ML Python

Choose your learning style9 modes available
Introduction

We use a pipeline with GridSearchCV to try many settings for a model and preprocessing steps all at once. This helps find the best way to prepare data and train the model without mistakes.

When you want to test different data cleaning or scaling methods together with model settings.
When you want to avoid repeating code for preprocessing before training.
When you want to find the best model settings automatically by trying many options.
When you want to keep your code clean and easy to understand.
When you want to make sure your model works well on new data by tuning it carefully.
Syntax
ML Python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('step_name', transformer_or_model),
    ('model', estimator)
])

param_grid = {
    'step_name__parameter': [values],
    'model__parameter': [values]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=number_of_folds)
grid_search.fit(X_train, y_train)

Use double underscores __ to set parameters for steps inside the pipeline.

cv means how many parts to split data for testing during tuning.

Examples
This example tries scaling with or without centering and different regularization strengths for logistic regression.
ML Python
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

param_grid = {
    'scaler__with_mean': [True, False],
    'clf__C': [0.1, 1, 10]
}
This example tries different numbers of PCA components and SVM kernels.
ML Python
pipeline = Pipeline([
    ('pca', PCA()),
    ('svc', SVC())
])

param_grid = {
    'pca__n_components': [2, 3, 4],
    'svc__kernel': ['linear', 'rbf']
}
Sample Model

This program loads the iris flower data, splits it, and creates a pipeline that scales data and trains an SVM model. It tries different scaling options and SVM settings to find the best combination. Finally, it prints the best settings and how well the model works on test data.

ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Load data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC())
])

# Parameter grid
param_grid = {
    'scaler__with_mean': [True, False],
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf']
}

# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=3)
grid_search.fit(X_train, y_train)

# Best parameters
print('Best parameters:', grid_search.best_params_)

# Test accuracy
test_score = grid_search.score(X_test, y_test)
print(f'Test accuracy: {test_score:.2f}')
OutputSuccess
Important Notes

Always use pipelines to avoid data leakage during cross-validation.

GridSearchCV tries all combinations, so keep parameter lists small to save time.

You can add more preprocessing steps before the model in the pipeline.

Summary

Pipelines combine data steps and models into one object.

GridSearchCV finds the best settings by testing many options.

Use double underscores to set parameters inside pipeline steps.