What is bagging vs boosting in python

MlopsComparisonBeginner · 4 min read

Bagging vs Boosting in Python: Key Differences and Code Examples

In Python, bagging and boosting are ensemble methods to improve model accuracy by combining multiple models. Bagging builds models independently on random subsets of data, while boosting builds models sequentially, focusing on correcting previous errors.

⚖️

Quick Comparison

Here is a quick side-by-side comparison of bagging and boosting methods.

Factor	Bagging	Boosting
Model Building	Parallel, independent models	Sequential, dependent models
Data Sampling	Random subsets with replacement	Weighted samples focusing on errors
Goal	Reduce variance	Reduce bias and variance
Error Correction	No correction between models	Each model corrects previous errors
Common Algorithms	Random Forest	AdaBoost, Gradient Boosting
Risk of Overfitting	Lower risk	Higher risk if not tuned

⚖️

Key Differences

Bagging (Bootstrap Aggregating) creates multiple versions of a dataset by random sampling with replacement. It trains separate models independently on these samples and combines their predictions by voting or averaging. This approach mainly reduces variance and helps avoid overfitting.

Boosting builds models one after another, where each new model tries to fix the mistakes of the previous ones. It assigns higher weights to misclassified samples so the next model focuses more on them. This sequential learning reduces both bias and variance but can overfit if not carefully controlled.

In scikit-learn, bagging is implemented with classes like BaggingClassifier and RandomForestClassifier, while boosting includes AdaBoostClassifier and GradientBoostingClassifier. The main difference is how models are trained and combined.

⚖️

Code Comparison

Example of bagging using RandomForestClassifier on the Iris dataset.

python

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)

# Bagging model
bagging_model = RandomForestClassifier(n_estimators=100, random_state=42)
bagging_model.fit(X_train, y_train)

# Predict and evaluate
predictions = bagging_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Bagging (Random Forest) Accuracy: {accuracy:.2f}")

Output

Bagging (Random Forest) Accuracy: 1.00

↔️

Boosting Equivalent

Equivalent boosting example using AdaBoostClassifier on the same Iris dataset.

python

from sklearn.ensemble import AdaBoostClassifier

# Boosting model
boosting_model = AdaBoostClassifier(n_estimators=100, random_state=42)
boosting_model.fit(X_train, y_train)

# Predict and evaluate
predictions_boost = boosting_model.predict(X_test)
accuracy_boost = accuracy_score(y_test, predictions_boost)
print(f"Boosting (AdaBoost) Accuracy: {accuracy_boost:.2f}")

Output

Boosting (AdaBoost) Accuracy: 1.00

🎯

When to Use Which

Choose bagging when you want to reduce variance and avoid overfitting, especially with unstable models like decision trees. Bagging works well when individual models are strong but prone to noise.

Choose boosting when you want to reduce bias and improve accuracy by focusing on hard-to-predict samples. Boosting is powerful but needs careful tuning to avoid overfitting.

In practice, start with bagging for simplicity and stability, then try boosting if you need better performance and can tune hyperparameters.

✅

Key Takeaways

Bagging builds independent models on random data samples to reduce variance.

Boosting builds models sequentially to correct previous errors and reduce bias.

RandomForest is a popular bagging method; AdaBoost and Gradient Boosting are common boosting methods.

Bagging is safer against overfitting; boosting can achieve higher accuracy but needs tuning.

Choose bagging for stability and boosting for improved accuracy on complex tasks.