Stacking and blending help combine many simple models to make one stronger model. This often gives better guesses than any single model alone.
Stacking and blending in ML Python
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import StackingClassifier # Define base models base_models = [ ('rf', RandomForestClassifier()), ('lr', LogisticRegression(max_iter=1000)) ] # Define stacking model stacking_model = StackingClassifier( estimators=base_models, final_estimator=LogisticRegression(max_iter=1000) ) # Fit model stacking_model.fit(X_train, y_train) # Predict predictions = stacking_model.predict(X_test)
Stacking uses base models to make predictions, then a final model learns from these predictions.
Blending is similar but uses a holdout set to train the final model instead of cross-validation.
from sklearn.ensemble import StackingClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier base_models = [ ('dt', DecisionTreeClassifier(max_depth=3)), ('lr', LogisticRegression(max_iter=1000)) ] stacking = StackingClassifier( estimators=base_models, final_estimator=LogisticRegression(max_iter=1000) ) stacking.fit(X_train, y_train) predictions = stacking.predict(X_test)
# Blending example (conceptual) # Split training data into train and holdout X_train_main, X_holdout, y_train_main, y_holdout = train_test_split(X_train, y_train, test_size=0.2) # Train base models on X_train_main # Predict on X_holdout # Use predictions on X_holdout as features to train final model # Predict on test data using base models and final model
This program loads the iris flower data, splits it into training and test sets, trains two base models (random forest and gradient boosting), then stacks them using logistic regression as the final model. It prints the accuracy on the test set.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load data iris = load_iris() X, y = iris.data, iris.target # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Define base models base_models = [ ('rf', RandomForestClassifier(random_state=42)), ('gb', GradientBoostingClassifier(random_state=42)) ] # Define stacking model stacking_model = StackingClassifier( estimators=base_models, final_estimator=LogisticRegression(max_iter=1000), cv=5 ) # Train stacking model stacking_model.fit(X_train, y_train) # Predict y_pred = stacking_model.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Stacking model accuracy: {accuracy:.2f}")
Stacking usually improves accuracy but can be slower to train because it trains multiple models.
Blending is simpler but may waste some training data for the holdout set.
Common mistake: Not using cross-validation or holdout properly can cause overfitting in stacking/blending.
Stacking and blending combine multiple models to make better predictions.
Stacking uses cross-validation to train a final model on base model predictions.
Blending uses a holdout set instead of cross-validation for the final model training.