Probability calibration helps make sure the chances predicted by a model match real-world outcomes. It makes predictions more trustworthy and easier to understand.
Probability calibration in ML Python
from sklearn.calibration import CalibratedClassifierCV calibrated_model = CalibratedClassifierCV(base_estimator=model, method='sigmoid', cv='prefit') calibrated_model.fit(X_calibration, y_calibration) # To predict calibrated probabilities: probs = calibrated_model.predict_proba(X_test)
base_estimator is the original model you want to calibrate.
method can be 'sigmoid' (Platt scaling) or 'isotonic' (a non-parametric method).
from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibratedClassifierCV model = LogisticRegression() model.fit(X_train, y_train) calibrated = CalibratedClassifierCV(base_estimator=model, method='sigmoid', cv='prefit') calibrated.fit(X_calibration, y_calibration) probs = calibrated.predict_proba(X_test)
from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV rf = RandomForestClassifier() rf.fit(X_train, y_train) calibrated_rf = CalibratedClassifierCV(base_estimator=rf, method='isotonic', cv=5) calibrated_rf.fit(X_train, y_train) probs = calibrated_rf.predict_proba(X_test)
This program trains a random forest, calibrates its predicted probabilities using a separate calibration set, and compares predicted probabilities before and after calibration. It also prints calibration curve points to see how close predicted probabilities are to true outcomes.
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV, calibration_curve import matplotlib.pyplot as plt # Create a simple binary classification dataset X, y = make_classification(n_samples=1000, n_features=20, random_state=42) # Split data into train, calibration, and test sets X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) X_calib, X_test, y_calib, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # Train a random forest classifier rf = RandomForestClassifier(random_state=42) rf.fit(X_train, y_train) # Calibrate the classifier using sigmoid method and calibration set calibrated_rf = CalibratedClassifierCV(base_estimator=rf, method='sigmoid', cv='prefit') calibrated_rf.fit(X_calib, y_calib) # Predict probabilities before and after calibration probs_uncalibrated = rf.predict_proba(X_test)[:, 1] probs_calibrated = calibrated_rf.predict_proba(X_test)[:, 1] # Calculate calibration curves prob_true_uncal, prob_pred_uncal = calibration_curve(y_test, probs_uncalibrated, n_bins=10) prob_true_cal, prob_pred_cal = calibration_curve(y_test, probs_calibrated, n_bins=10) # Print first 5 predicted probabilities before and after calibration print('First 5 predicted probabilities before calibration:', probs_uncalibrated[:5]) print('First 5 predicted probabilities after calibration:', probs_calibrated[:5]) # Print calibration curve points print('Calibration curve points before calibration:', list(zip(prob_pred_uncal, prob_true_uncal))) print('Calibration curve points after calibration:', list(zip(prob_pred_cal, prob_true_cal)))
Calibration works best when you have a separate calibration dataset or use cross-validation.
Isotonic calibration can overfit if the calibration set is small.
Calibrated probabilities help in making better decisions when probabilities matter, not just class labels.
Probability calibration adjusts model outputs to better match true chances.
Use calibration when you need reliable probability estimates for decisions.
Common methods include sigmoid (Platt scaling) and isotonic regression.