Multi-step reasoning tasks require the model to correctly follow a chain of logic steps to reach the right answer. Because of this, accuracy is important to measure how often the model gets the full reasoning correct. However, since errors can happen at any step, precision and recall on intermediate reasoning steps or sub-tasks can also be useful to understand where mistakes occur. Overall, accuracy tells us if the model solves the whole problem correctly, while precision and recall help diagnose partial errors.
Multi-step reasoning in Prompt Engineering / GenAI - Model Metrics & Evaluation
| Predicted Correct | Predicted Incorrect |
|-------------------|---------------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP)| True Negative (TN) |
TP: Model correctly completes all reasoning steps.
FN: Model predicted incorrect when reasoning is actually correct.
FP and TN are less common but can represent partial step correctness in some setups.
Example counts:
TP = 80, FN = 20, FP = 5, TN = 95
Total samples = 200
Imagine a model that tries to solve math word problems step-by-step.
- High precision means when the model says a step is correct, it usually is. This avoids false positives but might miss some correct steps.
- High recall means the model finds most of the correct steps, but might also include some wrong ones.
For multi-step reasoning, high recall is important to catch all correct steps, but high precision ensures the reasoning is reliable. Balancing both with the F1 score helps measure overall step correctness.
Good: Accuracy above 85% means the model solves most problems fully correct. Precision and recall above 80% on reasoning steps show reliable and complete logic.
Bad: Accuracy below 50% means the model often fails to complete reasoning. Precision or recall below 50% on steps means many errors or missed logic, making the model unreliable.
- Accuracy paradox: High accuracy can be misleading if the dataset has many easy problems and few hard ones.
- Data leakage: If the model sees answers during training, metrics will be unrealistically high.
- Overfitting: Model performs well on training but poorly on new problems, showing low generalization.
- Ignoring intermediate steps: Only checking final answer misses errors in reasoning steps.
Your multi-step reasoning model has 98% accuracy but only 12% recall on intermediate reasoning steps. Is it good for production? Why or why not?
Answer: No, it is not good. The high accuracy means it often gets the final answer right, but the very low recall on steps means it misses most correct intermediate steps. This suggests the model might guess or shortcut reasoning, which can fail on harder problems or reduce trust in explanations.