Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Chain-of-thought prompting in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Chain-of-thought prompting
Which metric matters for Chain-of-thought prompting and WHY

Chain-of-thought prompting helps AI models explain their reasoning step-by-step. The key metric to check is accuracy of the final answer because it shows if the reasoning leads to the correct result. Also, explanation quality matters, but it is often measured by human review or specialized metrics like BLEU or ROUGE for text similarity. For simple evaluation, focus on accuracy to see if the chain-of-thought helps the model solve problems better.

Confusion matrix example for Chain-of-thought prompting
      | Predicted Correct | Predicted Wrong |
      |-------------------|-----------------|
      | True Positive (TP) | False Negative (FN) |
      | False Positive (FP)| True Negative (TN) |

      Example:
      TP = 80 (model reasoned correctly and answered right)
      FN = 10 (model reasoned but answered wrong)
      FP = 5  (model guessed right without good reasoning)
      TN = 5  (model guessed wrong without reasoning)

      Total samples = 100
    
Precision vs Recall tradeoff in Chain-of-thought prompting

Precision here means how often the model's reasoning leads to a correct answer when it claims to be confident. Recall means how many of all correct answers the model finds with good reasoning.

For example, if you want the model to only give answers when it is sure (high precision), it might skip some correct answers (lower recall). If you want the model to find as many correct answers as possible (high recall), it might sometimes give wrong answers (lower precision).

In tutoring or exams, high precision is important to trust the model's explanations. In brainstorming or idea generation, high recall might be better to explore many possibilities.

What good vs bad metric values look like for Chain-of-thought prompting
  • Good: Accuracy above 85% with clear, logical explanations. Precision and recall both above 80% means the model reasons well and answers correctly most of the time.
  • Bad: Accuracy below 60% or explanations that do not match the answer. Precision or recall below 50% means the model either guesses too much or misses many correct answers despite reasoning.
Common pitfalls in evaluating Chain-of-thought prompting
  • Accuracy paradox: High accuracy but poor reasoning quality if the task is easy or biased.
  • Data leakage: Model might memorize answers instead of reasoning, inflating metrics.
  • Overfitting: Model performs well on training prompts but poorly on new problems.
  • Ignoring explanation quality: Only checking final answer misses if reasoning is flawed or nonsensical.
Self-check question

Your chain-of-thought model has 98% accuracy but only 12% recall on correct reasoning steps. Is it good for production? Why or why not?

Answer: No, it is not good. Although accuracy is high, the model rarely finds correct reasoning steps (low recall). This means it often guesses right without proper reasoning, which reduces trust and usefulness in tasks needing explanations.

Key Result
Accuracy is key to check if chain-of-thought prompting improves correct answers; precision and recall reveal reasoning quality tradeoffs.

Practice

(1/5)
1. What is the main purpose of chain-of-thought prompting in AI?
easy
A. To increase the randomness of AI answers
B. To make AI respond faster
C. To reduce the size of the AI model
D. To help AI explain its reasoning step-by-step

Solution

  1. Step 1: Understand the concept of chain-of-thought prompting

    It is designed to guide AI to explain its reasoning in steps rather than giving a direct answer.
  2. Step 2: Identify the main goal

    The goal is to improve clarity and accuracy by showing the reasoning process.
  3. Final Answer:

    To help AI explain its reasoning step-by-step -> Option D
  4. Quick Check:

    Chain-of-thought = step-by-step reasoning [OK]
Hint: Think: Does it explain or just answer? Explanation means chain-of-thought [OK]
Common Mistakes:
  • Confusing speed with reasoning clarity
  • Thinking it reduces model size
  • Assuming it increases randomness
2. Which of the following is the correct way to start a chain-of-thought prompt?
easy
A. "Let's think step-by-step."
B. "Answer quickly without explanation."
C. "Give me a random fact."
D. "Explain your answer in one word."

Solution

  1. Step 1: Identify phrases that encourage stepwise reasoning

    "Let's think step-by-step" clearly asks for a stepwise explanation.
  2. Step 2: Eliminate options that do not prompt reasoning

    The options asking for a one-word answer, quick response without explanation, or a random fact do not encourage detailed reasoning.
  3. Final Answer:

    "Let's think step-by-step." -> Option A
  4. Quick Check:

    Prompt that guides stepwise thinking = "Let's think step-by-step." [OK]
Hint: Look for prompts that say 'step-by-step' or 'explain' [OK]
Common Mistakes:
  • Choosing prompts that ask for short or random answers
  • Ignoring the need for explanation
  • Confusing speed with reasoning
3. Given this prompt: "If you have 3 apples and get 2 more, how many apples do you have? Let's think step-by-step."
What is the expected output from the AI?
medium
A. "You have 2 apples."
B. "5 apples"
C. "3 plus 2 equals 5, so you have 5 apples."
D. "Apples are fruits."

Solution

  1. Step 1: Understand the prompt asks for step-by-step reasoning

    The phrase "Let's think step-by-step" asks the AI to explain the calculation process.
  2. Step 2: Identify the output that shows reasoning

    "3 plus 2 equals 5, so you have 5 apples." explains the addition step and then gives the answer, matching the prompt's request.
  3. Final Answer:

    "3 plus 2 equals 5, so you have 5 apples." -> Option C
  4. Quick Check:

    Step-by-step prompt = explanation + answer [OK]
Hint: Look for answers that explain before concluding [OK]
Common Mistakes:
  • Choosing only the final answer without explanation
  • Picking unrelated or incomplete answers
  • Ignoring the step-by-step request
4. You wrote this prompt: "Calculate 10 minus 4. Think step-by-step."
The AI responds with just "6". What is the likely problem?
medium
A. The question is too hard for AI
B. The prompt does not clearly ask for step-by-step explanation
C. The AI model is broken
D. The prompt is too long

Solution

  1. Step 1: Analyze the prompt wording

    The prompt says "Think step-by-step" but does not clearly say "Let's think step-by-step" or "Explain step-by-step."
  2. Step 2: Understand AI needs clear instructions

    Without a clear phrase like "Let's think step-by-step," AI may skip explanation and give a direct answer.
  3. Final Answer:

    The prompt does not clearly ask for step-by-step explanation -> Option B
  4. Quick Check:

    Clear prompt needed for explanation [OK]
Hint: Use exact phrases like 'Let's think step-by-step' for explanations [OK]
Common Mistakes:
  • Blaming AI model instead of prompt clarity
  • Assuming question difficulty causes no explanation
  • Thinking prompt length affects reasoning
5. You want the AI to solve this complex problem: "If a train travels 60 miles in 1 hour and then 90 miles in 1.5 hours, what is the average speed? Let's think step-by-step."
Which chain-of-thought prompt addition will best improve the AI's accuracy?
hard
A. "Explain each calculation clearly before giving the final answer."
B. "Just give the final number quickly."
C. "Ignore the question and talk about trains."
D. "Answer with a random speed value."

Solution

  1. Step 1: Recognize the need for detailed reasoning in complex problems

    Complex problems benefit from clear stepwise explanations to avoid mistakes.
  2. Step 2: Identify the prompt that encourages detailed calculation explanation

    "Explain each calculation clearly before giving the final answer." explicitly asks for clear explanation before the final answer, improving accuracy.
  3. Final Answer:

    "Explain each calculation clearly before giving the final answer." -> Option A
  4. Quick Check:

    Detailed explanation improves accuracy [OK]
Hint: Ask AI to explain calculations clearly for complex problems [OK]
Common Mistakes:
  • Choosing prompts that skip explanation
  • Ignoring the problem complexity
  • Selecting irrelevant or random answer options