Probability calibration helps make sure the chances predicted by a model match real-world outcomes. It makes predictions more trustworthy and easier to understand.
Probability calibration in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.calibration import CalibratedClassifierCV calibrated_model = CalibratedClassifierCV(base_estimator=model, method='sigmoid', cv='prefit') calibrated_model.fit(X_calibration, y_calibration) # To predict calibrated probabilities: probs = calibrated_model.predict_proba(X_test)
base_estimator is the original model you want to calibrate.
method can be 'sigmoid' (Platt scaling) or 'isotonic' (a non-parametric method).
from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibratedClassifierCV model = LogisticRegression() model.fit(X_train, y_train) calibrated = CalibratedClassifierCV(base_estimator=model, method='sigmoid', cv='prefit') calibrated.fit(X_calibration, y_calibration) probs = calibrated.predict_proba(X_test)
from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV rf = RandomForestClassifier() rf.fit(X_train, y_train) calibrated_rf = CalibratedClassifierCV(base_estimator=rf, method='isotonic', cv=5) calibrated_rf.fit(X_train, y_train) probs = calibrated_rf.predict_proba(X_test)
This program trains a random forest, calibrates its predicted probabilities using a separate calibration set, and compares predicted probabilities before and after calibration. It also prints calibration curve points to see how close predicted probabilities are to true outcomes.
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV, calibration_curve import matplotlib.pyplot as plt # Create a simple binary classification dataset X, y = make_classification(n_samples=1000, n_features=20, random_state=42) # Split data into train, calibration, and test sets X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) X_calib, X_test, y_calib, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # Train a random forest classifier rf = RandomForestClassifier(random_state=42) rf.fit(X_train, y_train) # Calibrate the classifier using sigmoid method and calibration set calibrated_rf = CalibratedClassifierCV(base_estimator=rf, method='sigmoid', cv='prefit') calibrated_rf.fit(X_calib, y_calib) # Predict probabilities before and after calibration probs_uncalibrated = rf.predict_proba(X_test)[:, 1] probs_calibrated = calibrated_rf.predict_proba(X_test)[:, 1] # Calculate calibration curves prob_true_uncal, prob_pred_uncal = calibration_curve(y_test, probs_uncalibrated, n_bins=10) prob_true_cal, prob_pred_cal = calibration_curve(y_test, probs_calibrated, n_bins=10) # Print first 5 predicted probabilities before and after calibration print('First 5 predicted probabilities before calibration:', probs_uncalibrated[:5]) print('First 5 predicted probabilities after calibration:', probs_calibrated[:5]) # Print calibration curve points print('Calibration curve points before calibration:', list(zip(prob_pred_uncal, prob_true_uncal))) print('Calibration curve points after calibration:', list(zip(prob_pred_cal, prob_true_cal)))
Calibration works best when you have a separate calibration dataset or use cross-validation.
Isotonic calibration can overfit if the calibration set is small.
Calibrated probabilities help in making better decisions when probabilities matter, not just class labels.
Probability calibration adjusts model outputs to better match true chances.
Use calibration when you need reliable probability estimates for decisions.
Common methods include sigmoid (Platt scaling) and isotonic regression.
Practice
probability calibration in machine learning?Solution
Step 1: Understand the purpose of probability calibration
Probability calibration aims to make predicted probabilities match the actual chance of an event happening.Step 2: Differentiate from accuracy and training speed
Accuracy relates to correct labels, not probability quality. Calibration focuses on probability quality, not dataset size or speed.Final Answer:
To adjust predicted probabilities to better reflect true likelihoods -> Option AQuick Check:
Calibration = Adjust probabilities [OK]
- Confusing calibration with accuracy improvement
- Thinking calibration changes dataset size
- Assuming calibration speeds training
Solution
Step 1: Identify calibration methods
Platt scaling is a sigmoid-based method commonly used to calibrate probabilities.Step 2: Exclude unrelated methods
Gradient boosting is a model training technique, K-means is clustering, and PCA is dimensionality reduction, none are calibration methods.Final Answer:
Platt scaling -> Option CQuick Check:
Calibration method = Platt scaling [OK]
- Confusing boosting with calibration
- Mixing clustering or PCA with calibration
- Choosing any popular ML method as calibration
calibrated_clf.predict_proba([[0.5, 1.5]])?
from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibratedClassifierCV X, y = make_classification(n_samples=100, n_features=2, random_state=42) clf = LogisticRegression().fit(X, y) calibrated_clf = CalibratedClassifierCV(clf, method='sigmoid', cv='prefit') calibrated_clf.fit(X, y) probs = calibrated_clf.predict_proba([[0.5, 1.5]]) print(probs)
Solution
Step 1: Understand CalibratedClassifierCV output
Using method='sigmoid' with cv='prefit' fits calibration on the existing model and outputs probabilities as a 2D array for each class.Step 2: Check predict_proba output format
predict_proba returns probabilities for each class in a 2D array, not a single float or labels.Final Answer:
A 2D array with calibrated probabilities for each class, e.g. [[0.3, 0.7]] -> Option AQuick Check:
predict_proba output = 2D array [OK]
- Expecting a single float instead of array
- Confusing predict_proba with predict
- Misunderstanding cv='prefit' usage
CalibratedClassifierCV with cv=5, but got an error: "ValueError: Expected cv split to be a cross-validation generator or an iterable, got int instead." What is the likely cause?Solution
Step 1: Analyze the error message
The error "Expected cv split to be a cross-validation generator or an iterable, got int instead." directly points to the cv parameter receiving an integer (5) where a splitter was expected.Step 2: Check CalibratedClassifierCV cv usage
This occurs when cv is passed as int but the context requires an explicit cross-validation object like StratifiedKFold(5).Step 3: Rule out unrelated causes
Base fitting (D) is for cv='prefit'; dataset size (B) or method (C) don't trigger this error.Final Answer:
You passed an integer instead of a cross-validation splitter object -> Option BQuick Check:
Error 'got int instead' = cv type mismatch [OK]
- Passing an integer to cv instead of a splitter object
- Confusing cv parameter usage
- Assuming calibration method causes error
Solution
Step 1: Consider calibration methods for small datasets
Platt scaling (sigmoid) is preferred for small datasets because it is less prone to overfitting than isotonic regression.Step 2: Use cross-validation to avoid losing accuracy
Applying Platt scaling with cross-validation calibrates probabilities without retraining the base model or losing accuracy.Step 3: Evaluate other options
Isotonic regression may overfit small data, retraining may not fix calibration, discarding probabilities loses useful info.Final Answer:
Apply Platt scaling calibration using cross-validation -> Option DQuick Check:
Small data calibration = Platt scaling + CV [OK]
- Using isotonic regression on small data causing overfit
- Retraining model instead of calibrating
- Ignoring probability calibration importance
