Chain-of-thought prompting helps AI models explain their reasoning step-by-step. The key metric to check is accuracy of the final answer because it shows if the reasoning leads to the correct result. Also, explanation quality matters, but it is often measured by human review or specialized metrics like BLEU or ROUGE for text similarity. For simple evaluation, focus on accuracy to see if the chain-of-thought helps the model solve problems better.
Chain-of-thought prompting in Prompt Engineering / GenAI - Model Metrics & Evaluation
| Predicted Correct | Predicted Wrong |
|-------------------|-----------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP)| True Negative (TN) |
Example:
TP = 80 (model reasoned correctly and answered right)
FN = 10 (model reasoned but answered wrong)
FP = 5 (model guessed right without good reasoning)
TN = 5 (model guessed wrong without reasoning)
Total samples = 100
Precision here means how often the model's reasoning leads to a correct answer when it claims to be confident. Recall means how many of all correct answers the model finds with good reasoning.
For example, if you want the model to only give answers when it is sure (high precision), it might skip some correct answers (lower recall). If you want the model to find as many correct answers as possible (high recall), it might sometimes give wrong answers (lower precision).
In tutoring or exams, high precision is important to trust the model's explanations. In brainstorming or idea generation, high recall might be better to explore many possibilities.
- Good: Accuracy above 85% with clear, logical explanations. Precision and recall both above 80% means the model reasons well and answers correctly most of the time.
- Bad: Accuracy below 60% or explanations that do not match the answer. Precision or recall below 50% means the model either guesses too much or misses many correct answers despite reasoning.
- Accuracy paradox: High accuracy but poor reasoning quality if the task is easy or biased.
- Data leakage: Model might memorize answers instead of reasoning, inflating metrics.
- Overfitting: Model performs well on training prompts but poorly on new problems.
- Ignoring explanation quality: Only checking final answer misses if reasoning is flawed or nonsensical.
Your chain-of-thought model has 98% accuracy but only 12% recall on correct reasoning steps. Is it good for production? Why or why not?
Answer: No, it is not good. Although accuracy is high, the model rarely finds correct reasoning steps (low recall). This means it often guesses right without proper reasoning, which reduces trust and usefulness in tasks needing explanations.