When breaking a big task into smaller parts, we want to measure how well each part helps the whole goal. Key metrics include accuracy of each subtask, completion rate, and error propagation. These show if the parts work well alone and together. If subtasks have low accuracy, the final result suffers. So, monitoring each step's performance helps improve the whole process.
Task decomposition strategies in Agentic AI - Model Metrics & Evaluation
Subtask: Classify images into cats or dogs
Predicted
Cat Dog
True Cat 80 20
Dog 15 85
Total samples = 200
TP (Cat) = 80, FP (Cat) = 15, FN (Cat) = 20, TN (Cat) = 85
This matrix helps calculate precision and recall for the subtask. Good subtasks have high precision and recall, so errors don't build up.
Imagine a task split into parts where one part finds important items (high recall) but sometimes mistakes others (low precision). Another part is very sure but misses some items (high precision, low recall). Balancing these is key. For example, in a medical diagnosis task, missing a disease (low recall) is worse than false alarms (low precision). So, task parts should be tuned to the goal.
- Good: Subtasks with precision and recall above 90%, low error propagation, and consistent completion.
- Bad: Subtasks with precision or recall below 60%, causing many errors to pass on and reduce final output quality.
Good metrics mean subtasks work well alone and together. Bad metrics show weak parts hurting the whole.
- Ignoring error propagation: Small errors in subtasks can grow and ruin final results.
- Overfitting subtasks: Subtasks too tuned to training data may fail in real use.
- Data leakage: Subtasks accidentally use future info, inflating metrics falsely.
- Accuracy paradox: High accuracy in subtasks with imbalanced data can be misleading.
Your task decomposition model has 98% accuracy overall but one subtask has only 12% recall on a critical class. Is it good for production?
Answer: No. The low recall means the subtask misses many important cases. This will cause the whole system to fail on those cases, despite high overall accuracy. Improving recall in that subtask is crucial before production.