Bagging is a technique used in machine learning. What is its main goal?
Think about how bagging uses multiple models and what problem it tries to solve.
Bagging stands for Bootstrap Aggregating. It reduces variance by training multiple models on different random samples of the data and then averaging their predictions to get a more stable result.
Given three models trained on different samples, their predictions on a test point are: Model1: 0.7, Model2: 0.4, Model3: 0.9. What is the final bagging prediction by averaging?
Calculate the average of the three predictions.
The average is (0.7 + 0.4 + 0.9) / 3 = 2.0 / 3 = 0.666..., rounded to 0.67.
Bagging is most effective in reducing variance. Which of these model types typically benefits the most from bagging?
Think about which models tend to have high variance and overfit easily.
Complex decision trees tend to overfit and have high variance. Bagging helps by averaging many trees trained on different samples, reducing variance and improving stability.
What is the effect of increasing the number of base models (estimators) in a bagging ensemble?
Think about how averaging more models affects variance and prediction stability.
Adding more base models reduces variance by averaging more predictions, improving stability. However, after many models, improvements become smaller.
Consider this Python code snippet for bagging:
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample
X_train, y_train = ... # training data
models = []
for _ in range(5):
X_sample, y_sample = resample(X_train, y_train)
model = DecisionTreeClassifier()
model.fit(X_sample, y_sample)
models.append(model)
# Predict on test data
predictions = []
for model in models:
predictions.append(model.predict(X_test))
final_prediction = sum(predictions) / len(models)What error will this code raise or what is the problem?
Check the type of objects in predictions and how sum() works on lists of arrays.
Each model.predict returns a numpy array. Using sum() on a list of arrays tries to add arrays starting from 0 (int), causing a TypeError. Instead, use numpy.sum or initialize with zeros array.