0
0
Agentic_aiml~8 mins

Human approval workflows in Agentic Ai - Model Metrics & Evaluation

Choose your learning style8 modes available
Metrics & Evaluation - Human approval workflows
Which metric matters for Human approval workflows and WHY

In human approval workflows, the key metrics are Precision and Recall. Precision tells us how often the system's approvals are actually correct, which helps avoid unnecessary human reviews. Recall tells us how many truly important cases the system catches for human approval, ensuring no critical decisions are missed. Balancing these metrics ensures the workflow is efficient and safe.

Confusion matrix for Human approval workflows
      |---------------------------|
      |           | Predicted    |
      | Actual    | Approve | Review |
      |-----------|---------|--------|
      | Approve   |   TP    |   FN   |
      | Review    |   FP    |   TN   |
      |---------------------------|

      TP = Correctly auto-approved cases
      FP = Incorrectly auto-approved cases (should be reviewed)
      FN = Cases sent for review but could be auto-approved
      TN = Correctly sent for review
    
Precision vs Recall tradeoff with examples

If the system has high precision, it means most auto-approvals are truly safe, so humans rarely need to fix mistakes. But if recall is low, many cases that could be auto-approved are sent to humans, causing extra work.

If recall is high, the system catches almost all cases that should be auto-approved, reducing human workload. But if precision is low, some unsafe cases slip through without review, risking errors.

Example: In a loan approval system, high precision avoids wrongly approving risky loans automatically. High recall ensures most safe loans are approved without delay.

What "good" vs "bad" metric values look like

Good: Precision and recall both above 90%. This means the system auto-approves mostly correct cases and catches nearly all safe cases, balancing safety and efficiency.

Bad: Precision below 70% means many unsafe cases are auto-approved, risking errors. Recall below 50% means many safe cases are sent to humans unnecessarily, increasing workload.

Common pitfalls in metrics for Human approval workflows
  • Accuracy paradox: If most cases are safe, a model that always sends to review can have high accuracy but poor usefulness.
  • Data leakage: Using future information in training can inflate metrics but fail in real use.
  • Overfitting: Metrics look great on training data but drop on new cases, causing poor real-world performance.
  • Ignoring class imbalance: If safe cases are rare, metrics must be carefully chosen to reflect true performance.
Self-check question

Your human approval model has 98% accuracy but only 12% recall on safe cases. Is it good for production? Why not?

Answer: No, it is not good. The low recall means the system misses most safe cases and sends them to humans unnecessarily, increasing workload despite high accuracy. This harms efficiency and defeats the purpose of automation.

Key Result
Precision and recall are key to balance safety and efficiency in human approval workflows.