We use a pipeline with GridSearchCV to try many settings for a model and preprocessing steps all at once. This helps find the best way to prepare data and train the model without mistakes.
Pipeline with GridSearchCV in ML Python
from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV pipeline = Pipeline([ ('step_name', transformer_or_model), ('model', estimator) ]) param_grid = { 'step_name__parameter': [values], 'model__parameter': [values] } grid_search = GridSearchCV(pipeline, param_grid, cv=number_of_folds) grid_search.fit(X_train, y_train)
Use double underscores __ to set parameters for steps inside the pipeline.
cv means how many parts to split data for testing during tuning.
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
param_grid = {
'scaler__with_mean': [True, False],
'clf__C': [0.1, 1, 10]
}pipeline = Pipeline([
('pca', PCA()),
('svc', SVC())
])
param_grid = {
'pca__n_components': [2, 3, 4],
'svc__kernel': ['linear', 'rbf']
}This program loads the iris flower data, splits it, and creates a pipeline that scales data and trains an SVM model. It tries different scaling options and SVM settings to find the best combination. Finally, it prints the best settings and how well the model works on test data.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Load data X, y = load_iris(return_X_y=True) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('svc', SVC()) ]) # Parameter grid param_grid = { 'scaler__with_mean': [True, False], 'svc__C': [0.1, 1, 10], 'svc__kernel': ['linear', 'rbf'] } # Grid search grid_search = GridSearchCV(pipeline, param_grid, cv=3) grid_search.fit(X_train, y_train) # Best parameters print('Best parameters:', grid_search.best_params_) # Test accuracy test_score = grid_search.score(X_test, y_test) print(f'Test accuracy: {test_score:.2f}')
Always use pipelines to avoid data leakage during cross-validation.
GridSearchCV tries all combinations, so keep parameter lists small to save time.
You can add more preprocessing steps before the model in the pipeline.
Pipelines combine data steps and models into one object.
GridSearchCV finds the best settings by testing many options.
Use double underscores to set parameters inside pipeline steps.