For chain-of-thought reasoning in agents, accuracy and step-wise correctness are key metrics. Accuracy tells us how often the agent's final answer is right. Step-wise correctness checks if each reasoning step is logically sound. This matters because chain-of-thought breaks down problems into steps, so errors in early steps can cause wrong final answers. Measuring both helps us know if the agent reasons well or just guesses.
Chain-of-thought reasoning in agents in Agentic AI - Model Metrics & Evaluation
Final Answer Confusion Matrix (Example):
| Predicted Correct | Predicted Wrong |
-------|-------------------|-----------------|
Actual |
Correct| 85 | 15 |
Wrong | 10 | 90 |
Total samples = 200
Step-wise correctness can be shown as:
Steps Correct: 400 out of 500 steps (80%)
Steps Incorrect: 100 out of 500 steps (20%)
In chain-of-thought agents, precision means how many of the agent's final answers labeled correct truly are correct. Recall means how many of all truly correct answers the agent finds.
Example: If an agent is very cautious and only answers when very sure, it may have high precision (few wrong answers) but low recall (misses many correct answers).
Why it matters: For a tutoring agent, high precision is important to avoid confusing learners with wrong answers. For a brainstorming agent, high recall is better to explore many ideas, even if some are wrong.
Good metrics:
- Final answer accuracy above 85%
- Step-wise correctness above 80%
- Balanced precision and recall (both above 80%)
Bad metrics:
- Final answer accuracy below 60%
- Step-wise correctness below 50%
- Very high precision but very low recall (or vice versa), showing poor tradeoff
- Ignoring step-wise errors: Only checking final answer accuracy misses if the agent's reasoning is flawed but guesses right.
- Data leakage: Training on test problems can inflate accuracy falsely.
- Overfitting: Agent memorizes answers instead of reasoning, showing high accuracy on training but low on new problems.
- Accuracy paradox: High accuracy on easy problems may hide poor reasoning on hard ones.
Your chain-of-thought agent has 98% final answer accuracy but only 12% step-wise correctness. Is it good for production? Why or why not?
Answer: No, it is not good. The low step-wise correctness means the agent's reasoning steps are mostly wrong, even if the final answers seem right. This suggests the agent guesses or shortcuts reasoning, which can fail on new or complex problems. Reliable chain-of-thought agents need both high final accuracy and high step-wise correctness.