Bagging vs Boosting in Python: Key Differences and Code Examples
bagging and boosting are ensemble methods to improve model accuracy by combining multiple models. Bagging builds models independently on random subsets of data, while boosting builds models sequentially, focusing on correcting previous errors.Quick Comparison
Here is a quick side-by-side comparison of bagging and boosting methods.
| Factor | Bagging | Boosting |
|---|---|---|
| Model Building | Parallel, independent models | Sequential, dependent models |
| Data Sampling | Random subsets with replacement | Weighted samples focusing on errors |
| Goal | Reduce variance | Reduce bias and variance |
| Error Correction | No correction between models | Each model corrects previous errors |
| Common Algorithms | Random Forest | AdaBoost, Gradient Boosting |
| Risk of Overfitting | Lower risk | Higher risk if not tuned |
Key Differences
Bagging (Bootstrap Aggregating) creates multiple versions of a dataset by random sampling with replacement. It trains separate models independently on these samples and combines their predictions by voting or averaging. This approach mainly reduces variance and helps avoid overfitting.
Boosting builds models one after another, where each new model tries to fix the mistakes of the previous ones. It assigns higher weights to misclassified samples so the next model focuses more on them. This sequential learning reduces both bias and variance but can overfit if not carefully controlled.
In scikit-learn, bagging is implemented with classes like BaggingClassifier and RandomForestClassifier, while boosting includes AdaBoostClassifier and GradientBoostingClassifier. The main difference is how models are trained and combined.
Code Comparison
Example of bagging using RandomForestClassifier on the Iris dataset.
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42) # Bagging model bagging_model = RandomForestClassifier(n_estimators=100, random_state=42) bagging_model.fit(X_train, y_train) # Predict and evaluate predictions = bagging_model.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f"Bagging (Random Forest) Accuracy: {accuracy:.2f}")
Boosting Equivalent
Equivalent boosting example using AdaBoostClassifier on the same Iris dataset.
from sklearn.ensemble import AdaBoostClassifier # Boosting model boosting_model = AdaBoostClassifier(n_estimators=100, random_state=42) boosting_model.fit(X_train, y_train) # Predict and evaluate predictions_boost = boosting_model.predict(X_test) accuracy_boost = accuracy_score(y_test, predictions_boost) print(f"Boosting (AdaBoost) Accuracy: {accuracy_boost:.2f}")
When to Use Which
Choose bagging when you want to reduce variance and avoid overfitting, especially with unstable models like decision trees. Bagging works well when individual models are strong but prone to noise.
Choose boosting when you want to reduce bias and improve accuracy by focusing on hard-to-predict samples. Boosting is powerful but needs careful tuning to avoid overfitting.
In practice, start with bagging for simplicity and stability, then try boosting if you need better performance and can tune hyperparameters.