Bagging helps reduce errors by combining many models. The main goal is to lower variance and improve accuracy. So, accuracy and error rate are key metrics to check if bagging works well. For classification, accuracy, precision, and recall show how well the combined model predicts. For regression, mean squared error (MSE) or mean absolute error (MAE) tell us how close predictions are to true values.
Bagging concept in ML Python - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Imagine a bagging model classifying emails as spam or not spam. Here is a confusion matrix from 100 emails:
| Predicted Spam | Predicted Not Spam |
|----------------|--------------------|
| True Positives (TP) = 40 |
| False Positives (FP) = 5 |
| False Negatives (FN) = 10 |
| True Negatives (TN) = 45 |
Totals: TP + FP + FN + TN = 40 + 5 + 10 + 45 = 100 emails.
From this, we calculate:
- Precision = TP / (TP + FP) = 40 / (40 + 5) = 0.89
- Recall = TP / (TP + FN) = 40 / (40 + 10) = 0.80
- Accuracy = (TP + TN) / Total = (40 + 45) / 100 = 0.85
Bagging usually improves recall by reducing missed cases because it combines many models. But sometimes, it may lower precision if it predicts too many positives.
For example, in medical tests, missing a sick patient (low recall) is worse than a false alarm (low precision). Bagging helps catch more sick patients by increasing recall.
In spam detection, high precision is important to avoid marking good emails as spam. Bagging can be tuned to balance this tradeoff by adjusting thresholds.
Good values:
- Accuracy above 85% on test data shows bagging improved predictions.
- Precision and recall both above 80% means balanced and reliable predictions.
- Lower error rates compared to a single model show bagging reduced variance.
Bad values:
- Accuracy close to random guessing (e.g., 50% for two classes) means bagging did not help.
- Very high precision but very low recall means many true cases are missed.
- High error rates or unstable results on new data suggest overfitting or poor bagging setup.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced. For example, if 95% of emails are not spam, a model always predicting not spam gets 95% accuracy but is useless.
- Data leakage: If test data leaks into training, bagging looks better than it really is.
- Overfitting: Bagging reduces overfitting but if base models are too complex, combined model may still overfit.
- Ignoring variance: Bagging mainly reduces variance, so metrics should be checked on new unseen data, not just training data.
Your bagging model has 98% accuracy but only 12% recall on fraud cases. Is it good for production?
Answer: No, it is not good. Even though accuracy is high, the model misses 88% of fraud cases (low recall). For fraud detection, catching fraud (high recall) is critical. This model would let most fraud slip through.
Practice
bagging in machine learning?Solution
Step 1: Understand bagging concept
Bagging stands for Bootstrap Aggregating, which means training many models on different random samples of the data.Step 2: Identify the purpose of bagging
It combines the results of these models to make predictions more stable and accurate.Final Answer:
Training multiple models on random samples and combining their results -> Option AQuick Check:
Bagging = multiple models + random samples + combine results [OK]
- Thinking bagging uses only one model
- Confusing bagging with feature selection
- Believing bagging increases model complexity by depth
Solution
Step 1: Recall scikit-learn bagging syntax
The correct class is BaggingClassifier, and it takes base_estimator and n_estimators as parameters.Step 2: Match parameters to options
BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10) uses base_estimator=DecisionTreeClassifier() and n_estimators=10, which is correct syntax.Final Answer:
BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10) -> Option BQuick Check:
BaggingClassifier + base_estimator + n_estimators = D [OK]
- Using wrong parameter names like 'base' or 'estimators'
- Confusing BaggingClassifier with Bagging
- Passing parameters in wrong order or with wrong names
from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier iris = load_iris() X, y = iris.data, iris.target bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=5, random_state=42) bagging.fit(X, y) predictions = bagging.predict(X) print(sum(predictions == y))What does the printed number represent?
Solution
Step 1: Understand the code output
The code prints sum(predictions == y), which counts how many predicted labels match the true labels.Step 2: Interpret the printed value meaning
This count is the number of correct predictions on the training data.Final Answer:
Number of correct predictions on the training data -> Option AQuick Check:
sum(predictions == y) = correct predictions [OK]
- Thinking it counts incorrect predictions
- Confusing it with dataset size
- Assuming it prints number of trees
from sklearn.ensemble import BaggingClassifier bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators='10') bagging.fit(X_train, y_train)What is the likely cause of the error?
Solution
Step 1: Check parameter types
n_estimators expects an integer number of models, but '10' is a string.Step 2: Identify error cause
Passing a string instead of int causes a type error when fitting the model.Final Answer:
n_estimators should be an integer, not a string -> Option CQuick Check:
n_estimators must be int, not str [OK]
- Passing n_estimators as string instead of int
- Forgetting to import DecisionTreeClassifier
- Thinking base_estimator must be string
Solution
Step 1: Understand bagging effect on overfitting
Bagging reduces overfitting by training many models on random samples and averaging results.Step 2: Choose model depth and sampling
Shallow trees reduce overfitting individually, and random sampling adds diversity, improving stability and accuracy.Final Answer:
Use many shallow decision trees trained on random samples and combine their votes -> Option DQuick Check:
Bagging + shallow trees + random samples = less overfitting [OK]
- Using one deep tree causes overfitting
- Training many deep trees on full data lacks diversity
- Ignoring bagging and using single tree
