Stacking and blending help combine many simple models to make one stronger model. This often gives better guesses than any single model alone.
Stacking and blending in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import StackingClassifier # Define base models base_models = [ ('rf', RandomForestClassifier()), ('lr', LogisticRegression(max_iter=1000)) ] # Define stacking model stacking_model = StackingClassifier( estimators=base_models, final_estimator=LogisticRegression(max_iter=1000) ) # Fit model stacking_model.fit(X_train, y_train) # Predict predictions = stacking_model.predict(X_test)
Stacking uses base models to make predictions, then a final model learns from these predictions.
Blending is similar but uses a holdout set to train the final model instead of cross-validation.
from sklearn.ensemble import StackingClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier base_models = [ ('dt', DecisionTreeClassifier(max_depth=3)), ('lr', LogisticRegression(max_iter=1000)) ] stacking = StackingClassifier( estimators=base_models, final_estimator=LogisticRegression(max_iter=1000) ) stacking.fit(X_train, y_train) predictions = stacking.predict(X_test)
# Blending example (conceptual) # Split training data into train and holdout X_train_main, X_holdout, y_train_main, y_holdout = train_test_split(X_train, y_train, test_size=0.2) # Train base models on X_train_main # Predict on X_holdout # Use predictions on X_holdout as features to train final model # Predict on test data using base models and final model
This program loads the iris flower data, splits it into training and test sets, trains two base models (random forest and gradient boosting), then stacks them using logistic regression as the final model. It prints the accuracy on the test set.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load data iris = load_iris() X, y = iris.data, iris.target # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Define base models base_models = [ ('rf', RandomForestClassifier(random_state=42)), ('gb', GradientBoostingClassifier(random_state=42)) ] # Define stacking model stacking_model = StackingClassifier( estimators=base_models, final_estimator=LogisticRegression(max_iter=1000), cv=5 ) # Train stacking model stacking_model.fit(X_train, y_train) # Predict y_pred = stacking_model.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Stacking model accuracy: {accuracy:.2f}")
Stacking usually improves accuracy but can be slower to train because it trains multiple models.
Blending is simpler but may waste some training data for the holdout set.
Common mistake: Not using cross-validation or holdout properly can cause overfitting in stacking/blending.
Stacking and blending combine multiple models to make better predictions.
Stacking uses cross-validation to train a final model on base model predictions.
Blending uses a holdout set instead of cross-validation for the final model training.
Practice
Solution
Step 1: Understand the purpose of stacking and blending
Stacking and blending are ensemble techniques that combine predictions from multiple models.Step 2: Identify the goal of combining models
The goal is to improve prediction accuracy by leveraging strengths of different models.Final Answer:
To combine multiple models to improve prediction accuracy -> Option AQuick Check:
Stacking and blending = combine models for better accuracy [OK]
- Thinking stacking reduces dataset size
- Believing stacking replaces base models
- Confusing speed with accuracy improvement
Solution
Step 1: Recall stacking training method
Stacking trains the final model on predictions generated by base models using cross-validation.Step 2: Compare options to stacking method
Only Using cross-validation predictions from base models mentions cross-validation predictions, which is key to stacking.Final Answer:
Using cross-validation predictions from base models -> Option BQuick Check:
Stacking uses cross-validation predictions [OK]
- Confusing stacking with blending's holdout set
- Thinking stacking uses entire data without splits
- Assuming random feature subsets are used
X_blend_train if X_train has shape (1000, 10) and holdout_ratio=0.2?
from sklearn.model_selection import train_test_split X_train_full, X_holdout, y_train_full, y_holdout = train_test_split(X_train, y_train, test_size=holdout_ratio, random_state=42) # Base model predictions on holdout base_pred_holdout = base_model.predict(X_holdout) # Blending training data X_blend_train = base_pred_holdout.reshape(-1, 1)
Solution
Step 1: Calculate holdout set size
With 1000 samples and 0.2 holdout ratio, holdout size = 1000 * 0.2 = 200 samples.Step 2: Determine shape of base model predictions
Base model predicts on holdout set, so predictions have shape (200,). Reshaping to (-1, 1) makes it (200, 1).Final Answer:
(200, 1) -> Option AQuick Check:
Holdout size 200, reshape to (200,1) [OK]
- Using full training size instead of holdout size
- Confusing reshape dimensions
- Assuming predictions keep original feature count
ValueError: Found input variables with inconsistent numbers of samples. What is the likely cause?
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_predict base1 = LogisticRegression() base2 = RandomForestClassifier() pred1 = cross_val_predict(base1, X_train, y_train, cv=5) pred2 = cross_val_predict(base2, X_train, y_train, cv=5) X_meta = np.column_stack((pred1, pred2)) meta_model = LogisticRegression() meta_model.fit(X_meta, y_train)
Solution
Step 1: Understand cross_val_predict output
cross_val_predict returns predictions for each sample in X_train, so pred1 and pred2 should have length equal to X_train.Step 2: Identify cause of inconsistent sample sizes
If pred1 or pred2 have different lengths than y_train, stacking fails due to mismatch in input sizes.Final Answer:
Base model predictions have different lengths than y_train -> Option DQuick Check:
Prediction length mismatch causes ValueError [OK]
- Assuming models must be pre-fitted before cross_val_predict
- Thinking cv=5 is invalid for cross_val_predict
- Believing meta model type causes this error
Solution
Step 1: Understand blending process
Blending trains base models on full training data, then uses their predictions on a separate holdout set to train the blender model.Step 2: Evaluate options against blending steps
Only Train base models on full training data, predict on holdout, then train blender on holdout predictions correctly describes training base models on full data, predicting on holdout, and training blender on those predictions.Final Answer:
Train base models on full training data, predict on holdout, then train blender on holdout predictions -> Option CQuick Check:
Blending uses holdout predictions for blender training [OK]
- Training base models on holdout instead of full data
- Training blender without holdout predictions
- Ignoring holdout set in blending
