In human approval workflows, the key metrics are Precision and Recall. Precision tells us how often the system's approvals are actually correct, which helps avoid unnecessary human reviews. Recall tells us how many truly important cases the system catches for human approval, ensuring no critical decisions are missed. Balancing these metrics ensures the workflow is efficient and safe.
Human approval workflows in Agentic Ai - Model Metrics & Evaluation
|---------------------------|
| | Predicted |
| Actual | Approve | Review |
|-----------|---------|--------|
| Approve | TP | FN |
| Review | FP | TN |
|---------------------------|
TP = Correctly auto-approved cases
FP = Incorrectly auto-approved cases (should be reviewed)
FN = Cases sent for review but could be auto-approved
TN = Correctly sent for review
If the system has high precision, it means most auto-approvals are truly safe, so humans rarely need to fix mistakes. But if recall is low, many cases that could be auto-approved are sent to humans, causing extra work.
If recall is high, the system catches almost all cases that should be auto-approved, reducing human workload. But if precision is low, some unsafe cases slip through without review, risking errors.
Example: In a loan approval system, high precision avoids wrongly approving risky loans automatically. High recall ensures most safe loans are approved without delay.
Good: Precision and recall both above 90%. This means the system auto-approves mostly correct cases and catches nearly all safe cases, balancing safety and efficiency.
Bad: Precision below 70% means many unsafe cases are auto-approved, risking errors. Recall below 50% means many safe cases are sent to humans unnecessarily, increasing workload.
- Accuracy paradox: If most cases are safe, a model that always sends to review can have high accuracy but poor usefulness.
- Data leakage: Using future information in training can inflate metrics but fail in real use.
- Overfitting: Metrics look great on training data but drop on new cases, causing poor real-world performance.
- Ignoring class imbalance: If safe cases are rare, metrics must be carefully chosen to reflect true performance.
Your human approval model has 98% accuracy but only 12% recall on safe cases. Is it good for production? Why not?
Answer: No, it is not good. The low recall means the system misses most safe cases and sends them to humans unnecessarily, increasing workload despite high accuracy. This harms efficiency and defeats the purpose of automation.
