We use a pipeline with GridSearchCV to try many settings for a model and preprocessing steps all at once. This helps find the best way to prepare data and train the model without mistakes.
Pipeline with GridSearchCV in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV pipeline = Pipeline([ ('step_name', transformer_or_model), ('model', estimator) ]) param_grid = { 'step_name__parameter': [values], 'model__parameter': [values] } grid_search = GridSearchCV(pipeline, param_grid, cv=number_of_folds) grid_search.fit(X_train, y_train)
Use double underscores __ to set parameters for steps inside the pipeline.
cv means how many parts to split data for testing during tuning.
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
param_grid = {
'scaler__with_mean': [True, False],
'clf__C': [0.1, 1, 10]
}pipeline = Pipeline([
('pca', PCA()),
('svc', SVC())
])
param_grid = {
'pca__n_components': [2, 3, 4],
'svc__kernel': ['linear', 'rbf']
}This program loads the iris flower data, splits it, and creates a pipeline that scales data and trains an SVM model. It tries different scaling options and SVM settings to find the best combination. Finally, it prints the best settings and how well the model works on test data.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Load data X, y = load_iris(return_X_y=True) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('svc', SVC()) ]) # Parameter grid param_grid = { 'scaler__with_mean': [True, False], 'svc__C': [0.1, 1, 10], 'svc__kernel': ['linear', 'rbf'] } # Grid search grid_search = GridSearchCV(pipeline, param_grid, cv=3) grid_search.fit(X_train, y_train) # Best parameters print('Best parameters:', grid_search.best_params_) # Test accuracy test_score = grid_search.score(X_test, y_test) print(f'Test accuracy: {test_score:.2f}')
Always use pipelines to avoid data leakage during cross-validation.
GridSearchCV tries all combinations, so keep parameter lists small to save time.
You can add more preprocessing steps before the model in the pipeline.
Pipelines combine data steps and models into one object.
GridSearchCV finds the best settings by testing many options.
Use double underscores to set parameters inside pipeline steps.
Practice
Pipeline in machine learning?Solution
Step 1: Understand what a Pipeline does
A Pipeline chains preprocessing and model training steps so they run together smoothly.Step 2: Identify the main benefit
This chaining helps avoid mistakes and makes code cleaner by combining steps into one object.Final Answer:
To combine preprocessing steps and model training into one object -> Option AQuick Check:
Pipeline = combine steps [OK]
- Thinking Pipeline speeds up training automatically
- Confusing Pipeline with model selection
- Believing Pipeline creates visualizations
n_estimators of a RandomForest inside a pipeline named pipe for GridSearchCV?Solution
Step 1: Recall parameter naming in Pipeline
Parameters inside a pipeline step use double underscores: stepname__paramname.Step 2: Match step name and parameter
If the step is named 'randomforest', then 'randomforest__n_estimators' is correct syntax.Final Answer:
{'randomforest__n_estimators': [10, 50, 100]} -> Option DQuick Check:
Use double underscores between step and param [OK]
- Using single underscore instead of double
- Using dot or dash instead of double underscore
- Misspelling the pipeline step name
grid.best_params_ output?from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))
])
param_grid = {'clf__n_estimators': [20], 'clf__max_depth': [4]}
grid = GridSearchCV(pipe, param_grid, cv=2)
grid.fit(X_train, y_train)
print(grid.best_params_)Solution
Step 1: Understand pipeline and param_grid
The pipeline has a step named 'clf' for RandomForestClassifier. The param_grid uses 'clf__' prefix correctly.Step 2: Determine the output
Since param_grid specifies only one combination, GridSearchCV will select {'clf__n_estimators': 20, 'clf__max_depth': 4} as the best parameters.Final Answer:
{'clf__n_estimators': 20, 'clf__max_depth': 4} -> Option CQuick Check:
Best params match the only tested values [OK]
- Confusing step name 'clf' with 'classifier'
- Using single underscore in param_grid keys
- Assuming syntax error without checking keys
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
param_grid = {'randomforest__n_estimators': [10, 50]}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)Solution
Step 1: Check pipeline step names
The pipeline step for RandomForestClassifier is named 'model', not 'randomforest'.Step 2: Match param_grid keys to pipeline steps
Parameter keys must use the step name 'model' with double underscores, so 'model__n_estimators' is correct.Final Answer:
The param_grid key should be 'model__n_estimators', not 'randomforest__n_estimators' -> Option AQuick Check:
Param keys must match pipeline step names [OK]
- Using wrong step name in param_grid keys
- Thinking RandomForest can't be in pipeline
- Believing cv is mandatory (it defaults to 5)
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=0))
])
param_grid = ?Solution
Step 1: Understand how to toggle scaler on/off in pipeline
To test with and without scaling, replace the scaler step with StandardScaler() or None in param_grid using the step name 'scaler'.Step 2: Set classifier parameters correctly
Use 'clf__n_estimators' to test 10 and 50 trees for the RandomForestClassifier step named 'clf'.Final Answer:
{'scaler': [StandardScaler(), None], 'clf__n_estimators': [10, 50]} -> Option BQuick Check:
Toggle scaler by replacing step, tune clf params with double underscores [OK]
- Trying to set scaler params with double underscores incorrectly
- Using 'scaler__' key with no param name
- Not using None to disable a step
