For complex tasks that require planning, task success rate and efficiency metrics matter most. Task success rate shows if the plan leads to completing the task correctly. Efficiency metrics, like time or steps taken, show if the plan is practical and not wasteful. These metrics help us know if the AI plans well enough to handle complexity.
Why complex tasks need planning in Agentic AI - Why Metrics Matter
Task Outcome Confusion Matrix:
| Planned Success | Planned Failure |
--------------------------------------------------
Actual Success | TP=80 | FN=20 |
Actual Failure | FP=10 | TN=90 |
Total tasks = 200
- TP: Tasks planned and succeeded
- FP: Tasks planned but failed
- FN: Tasks not planned but succeeded
- TN: Tasks not planned and failed
Precision = TP / (TP + FP) = 80 / (80 + 10) = 0.89
Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.80
F1 Score = 2 * (0.89 * 0.80) / (0.89 + 0.80) = 0.84
In planning complex tasks, precision means the plan leads to success most of the time. Recall means the plan covers most tasks that need planning.
Example: A robot plans to clean rooms. High precision means when it plans a cleaning, it usually cleans well. High recall means it plans cleaning for almost all dirty rooms.
If precision is high but recall is low, the robot cleans well but misses many dirty rooms. If recall is high but precision is low, it tries to clean many rooms but often fails.
Good planning balances both to cover tasks well and succeed often.
Good planning metrics:
- Task success rate above 85%
- Precision and recall both above 80%
- Low number of unnecessary steps (high efficiency)
Bad planning metrics:
- Task success rate below 60%
- Precision or recall below 50%
- Plans that take too long or waste resources
- Accuracy paradox: High overall success can hide poor planning on complex subtasks.
- Data leakage: If the AI sees future task info during planning, metrics look better but are unrealistic.
- Overfitting: Planning that works only on training tasks but fails on new ones.
- Ignoring efficiency: A plan that always succeeds but takes too long is not practical.
Your AI planner has 98% task success rate but only 12% recall on complex subtasks. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means it misses most complex subtasks needing planning. Even with high overall success, many important tasks are ignored, which can cause failures in real use.