When we talk about responsible machine learning, the key metrics are fairness, accuracy, precision, and recall. Fairness ensures the model treats all groups equally, avoiding harm to any group. Accuracy tells us how often the model is right overall. Precision tells us how many predicted positives are actually positive. Recall is important when missing a positive case causes harm, like in medical diagnosis. These metrics help us check if the model is safe and fair to use in real life.
Why responsible ML prevents harm in ML Python - Why Metrics Matter
Predicted Positive Predicted Negative Actual Positive 80 20 Actual Negative 10 90 Total samples = 200 True Positives (TP) = 80 False Positives (FP) = 10 True Negatives (TN) = 90 False Negatives (FN) = 20
This matrix helps us calculate precision, recall, and accuracy to understand model performance and potential harm.
Imagine a spam email filter. High precision means most emails marked as spam really are spam, so you don't lose important emails. High recall means catching most spam emails, but you might mark some good emails as spam.
In medical tests, high recall is critical to catch all sick patients, even if some healthy people get extra tests (lower precision). Responsible ML balances these metrics to reduce harm depending on the situation.
- Good: High recall and precision, balanced fairness across groups, no bias in errors.
- Bad: High accuracy but low recall (missing many positive cases), unfair errors affecting certain groups more, biased predictions causing harm.
- Accuracy paradox: High accuracy can hide poor performance on minority groups.
- Data leakage: When test data leaks into training, metrics look better but model fails in real life.
- Overfitting: Model performs well on training data but poorly on new data, causing unexpected harm.
- Ignoring fairness: Good overall metrics but unfair treatment of some groups can cause harm.
Your model has 98% accuracy but only 12% recall on fraud cases. Is it good for production? Why or why not?
Answer: No, it is not good. Even though accuracy is high, the model misses 88% of fraud cases (low recall). This means many frauds go undetected, causing harm. For fraud detection, high recall is critical to catch as many frauds as possible.