When checking for bias in a model, we look beyond overall accuracy. We want to see if the model treats all groups fairly. Key metrics include Demographic Parity (does each group get positive outcomes equally?), Equal Opportunity (does each group have equal true positive rates?), and Disparate Impact (ratio of positive outcomes between groups). These metrics help us find if the model favors or harms certain groups unfairly.
Bias detection and mitigation in ML Python - Model Metrics & Evaluation
Imagine a model predicting loan approvals for two groups: Group A and Group B.
Group A Confusion Matrix:
TP=80 FP=20
FN=10 TN=90
Group B Confusion Matrix:
TP=40 FP=10
FN=30 TN=120
Total samples Group A = 80+20+10+90 = 200
Total samples Group B = 40+10+30+120 = 200
From these, we calculate metrics like True Positive Rate (Recall) for each group:
- Group A Recall = 80 / (80 + 10) = 0.89
- Group B Recall = 40 / (40 + 30) = 0.57
This shows Group B has a lower chance of correctly getting a positive outcome, indicating bias.
Suppose a hiring model favors Group A over Group B. If we only optimize for overall precision, the model might reject many qualified candidates from Group B (low recall for Group B). To fix this, we might accept a slight drop in precision to improve recall for Group B, making the model fairer.
Example:
- High precision but low recall for Group B means many qualified candidates are missed.
- Improving recall for Group B may lower precision but reduces unfair rejection.
Good bias metrics mean similar performance across groups:
- Recall difference between groups < 0.05 (5%)
- Disparate Impact ratio close to 1 (between 0.8 and 1.25 is often acceptable)
- Equal Opportunity difference small
Bad values show large gaps, like:
- Recall difference > 0.2 (20%)
- Disparate Impact < 0.8 or > 1.25
- One group has very low true positive rate compared to others
- Ignoring base rates: Different groups may have different real-world rates, so metrics must consider context.
- Accuracy paradox: High overall accuracy can hide poor performance on minority groups.
- Data leakage: Sensitive attributes leaking into features can cause hidden bias.
- Overfitting mitigation: Over-correcting bias can reduce overall model usefulness.
- Single metric focus: Using only one fairness metric can miss other bias types.
Your model has 98% accuracy overall but only 12% recall on fraud cases in a minority group. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most fraud cases in that group, which is harmful. High overall accuracy hides this problem because fraud is rare. You need to improve recall for that group to reduce bias and protect users.