Stacking and blending combine multiple models to improve predictions. The key metric depends on the task: for classification, accuracy, precision, recall, and F1 score matter. For regression, mean squared error or R-squared are important. We focus on metrics that show if the combined model predicts better than individual models. This helps us know if stacking or blending truly improves results.
Stacking and blending in ML Python - Model Metrics & Evaluation
Suppose we stack two models to detect spam emails. After combining, the confusion matrix might look like this:
| Predicted Spam | Predicted Not Spam |
|----------------|--------------------|
| True Positives (TP) = 85 |
| False Positives (FP) = 15 |
| False Negatives (FN) = 10 |
| True Negatives (TN) = 90 |
Total samples = 85 + 15 + 10 + 90 = 200
From this, we calculate:
- Precision = TP / (TP + FP) = 85 / (85 + 15) = 0.85
- Recall = TP / (TP + FN) = 85 / (85 + 10) = 0.8947
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.871
Stacking and blending aim to balance precision and recall better than single models. For example:
- If a spam filter has high precision but low recall, it misses many spam emails (bad for users).
- If it has high recall but low precision, many good emails are marked spam (annoying).
Stacking can combine models that are good at precision with those good at recall to get a better balance. This tradeoff depends on the problem's needs.
Good stacking/blending results show:
- Higher accuracy or F1 score than any single model alone.
- Balanced precision and recall suitable for the task.
- Stable performance on new data (not just training data).
Bad results show:
- No improvement or worse metrics compared to best single model.
- Overfitting signs: very high training accuracy but low test accuracy.
- Unbalanced precision or recall causing practical problems.
- Data leakage: Using test data in training the stacking model inflates metrics falsely.
- Overfitting: The meta-model may memorize training data, showing high training but poor test metrics.
- Ignoring metric tradeoffs: Focusing only on accuracy can hide poor recall or precision.
- Confusion matrix mismatch: Not verifying that TP, FP, TN, FN add up correctly can cause wrong metric calculations.
Your stacking model has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?
Answer: No, it is not good. High accuracy can be misleading if the fraud class is rare. The very low recall means the model misses most fraud cases, which is dangerous. For fraud detection, high recall is critical to catch as many frauds as possible.