ML Pythonml~8 mins

Bias detection and mitigation in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Bias detection and mitigation

Which metric matters for Bias Detection and Mitigation and WHY

When checking for bias in a model, we look beyond overall accuracy. We want to see if the model treats all groups fairly. Key metrics include Demographic Parity (does each group get positive outcomes equally?), Equal Opportunity (does each group have equal true positive rates?), and Disparate Impact (ratio of positive outcomes between groups). These metrics help us find if the model favors or harms certain groups unfairly.

Confusion Matrix Example for Two Groups

Imagine a model predicting loan approvals for two groups: Group A and Group B.

    Group A Confusion Matrix:
      TP=80  FP=20
      FN=10  TN=90

    Group B Confusion Matrix:
      TP=40  FP=10
      FN=30  TN=120

    Total samples Group A = 80+20+10+90 = 200
    Total samples Group B = 40+10+30+120 = 200

From these, we calculate metrics like True Positive Rate (Recall) for each group:

Group A Recall = 80 / (80 + 10) = 0.89
Group B Recall = 40 / (40 + 30) = 0.57

This shows Group B has a lower chance of correctly getting a positive outcome, indicating bias.

Precision vs Recall Tradeoff in Bias Context

Suppose a hiring model favors Group A over Group B. If we only optimize for overall precision, the model might reject many qualified candidates from Group B (low recall for Group B). To fix this, we might accept a slight drop in precision to improve recall for Group B, making the model fairer.

Example:

High precision but low recall for Group B means many qualified candidates are missed.
Improving recall for Group B may lower precision but reduces unfair rejection.

What "Good" vs "Bad" Metric Values Look Like

Good bias metrics mean similar performance across groups:

Recall difference between groups < 0.05 (5%)
Disparate Impact ratio close to 1 (between 0.8 and 1.25 is often acceptable)
Equal Opportunity difference small

Bad values show large gaps, like:

Recall difference > 0.2 (20%)
Disparate Impact < 0.8 or > 1.25
One group has very low true positive rate compared to others

Common Pitfalls in Bias Metrics

Ignoring base rates: Different groups may have different real-world rates, so metrics must consider context.
Accuracy paradox: High overall accuracy can hide poor performance on minority groups.
Data leakage: Sensitive attributes leaking into features can cause hidden bias.
Overfitting mitigation: Over-correcting bias can reduce overall model usefulness.
Single metric focus: Using only one fairness metric can miss other bias types.

Self-Check Question

Your model has 98% accuracy overall but only 12% recall on fraud cases in a minority group. Is it good for production? Why or why not?

Answer: No, it is not good. The low recall means the model misses most fraud cases in that group, which is harmful. High overall accuracy hides this problem because fraud is rare. You need to improve recall for that group to reduce bias and protect users.

Key Result

Fairness metrics like recall differences and disparate impact reveal bias beyond overall accuracy.