0
0
ML Pythonml~8 mins

Probability calibration in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Probability calibration
Which metric matters for Probability Calibration and WHY

Probability calibration checks if the model's predicted chances match real outcomes. For example, if a model says 70% chance of rain, it should rain about 7 times out of 10 when it says that. The key metrics are Calibration Curve and Brier Score. The calibration curve shows how predicted probabilities compare to actual results. The Brier score measures the average squared difference between predicted probabilities and actual outcomes. Lower Brier scores mean better calibration.

Confusion Matrix or Equivalent Visualization

Probability calibration is about probabilities, not just yes/no predictions, so confusion matrix alone is not enough. Instead, we use a Calibration Curve which groups predictions by probability bins and compares predicted vs actual frequency.

Probability Bin | Predicted Probability | Actual Frequency
---------------------------------------------------------
     0.0 - 0.1 | 0.08                  | 0.07
     0.1 - 0.2 | 0.15                  | 0.18
     0.2 - 0.3 | 0.25                  | 0.22
     0.3 - 0.4 | 0.35                  | 0.33
     0.4 - 0.5 | 0.45                  | 0.48
     0.5 - 0.6 | 0.55                  | 0.52
     0.6 - 0.7 | 0.65                  | 0.68
     0.7 - 0.8 | 0.75                  | 0.73
     0.8 - 0.9 | 0.85                  | 0.88
     0.9 - 1.0 | 0.95                  | 0.94

This table shows predicted probabilities and how often the event actually happened in that range. Good calibration means these numbers are close.

Precision vs Recall Tradeoff and Calibration

Precision and recall focus on classification decisions (yes/no), but calibration focuses on how well predicted probabilities match reality. A model can have high precision and recall but poor calibration if its probabilities are too confident or too low. For example, a spam filter might catch spam well (high recall) but if it always says 99% spam chance even when unsure, it is poorly calibrated. Calibration helps trust the probability values, which is important for decisions like medical diagnosis or weather forecasts.

What Good vs Bad Calibration Looks Like

Good calibration: Predicted probabilities closely match actual outcomes. For example, when the model predicts 0.7 chance, the event happens about 70% of the time. The calibration curve is close to the diagonal line. Brier score is low (closer to 0).

Bad calibration: Predicted probabilities are too high or too low compared to actual outcomes. For example, the model predicts 0.9 chance but the event happens only 50% of the time. The calibration curve deviates far from the diagonal. Brier score is higher.

Common Pitfalls in Probability Calibration
  • Ignoring calibration: Using raw probabilities without checking calibration can mislead decisions.
  • Small sample sizes: Calibration curves can be noisy if there are few samples in probability bins.
  • Overfitting calibration: Adjusting probabilities too much on training data can hurt performance on new data.
  • Confusing accuracy with calibration: A model can be accurate in classification but poorly calibrated in probabilities.
  • Data leakage: If calibration is done on data used for training, it gives overly optimistic results.
Self Check: Your model has 98% accuracy but poor calibration. Is it good?

Not necessarily. High accuracy means the model predicts the right class often, but if the predicted probabilities are not reliable, decisions based on those probabilities can be wrong. For example, if a medical test says 99% chance of disease but the real chance is only 50%, doctors might overreact. So, good calibration is important when you need trustworthy probability estimates, even if accuracy is high.

Key Result
Probability calibration is best measured by calibration curves and Brier score to ensure predicted probabilities match real outcomes.