Bird
Raised Fist0
Agentic AIml~8 mins

Plan-and-execute pattern in Agentic AI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Plan-and-execute pattern
Which metric matters for the Plan-and-execute pattern and WHY

The Plan-and-execute pattern involves an AI agent first creating a plan and then carrying it out. To evaluate this, we focus on task success rate and execution accuracy. Task success rate tells us if the agent completed the goal correctly. Execution accuracy measures how well the agent followed the plan steps. These metrics matter because a good plan is useless if not executed well, and good execution without a good plan may fail the goal.

Confusion matrix or equivalent visualization
Task Outcome Confusion Matrix:

                Predicted Success   Predicted Failure
Actual Success       TP = 85            FN = 15
Actual Failure       FP = 10            TN = 90

Total samples = 200

- TP (True Positive): Agent planned and executed successfully, and task succeeded.
- FP (False Positive): Agent thought task succeeded but it failed.
- FN (False Negative): Agent failed task despite planning and execution.
- TN (True Negative): Agent correctly identified failure or aborted.

From this:
- Precision = TP / (TP + FP) = 85 / (85 + 10) = 0.895
- Recall = TP / (TP + FN) = 85 / (85 + 15) = 0.85
- F1 Score = 2 * (0.895 * 0.85) / (0.895 + 0.85) ≈ 0.872
    
Precision vs Recall tradeoff with concrete examples

In plan-and-execute, precision means when the agent says it succeeded, it really did. High precision avoids false success claims, important in safety-critical tasks like robot surgery.

Recall means the agent finds all successful plans and executions. High recall ensures the agent does not miss opportunities to complete tasks, important in customer support bots that must solve all queries.

For example, a delivery robot with high precision but low recall might only deliver some packages but never claim false success. A robot with high recall but low precision might claim success often but sometimes fail deliveries, causing trust issues.

What "good" vs "bad" metric values look like for this use case

Good metrics: Precision and recall above 85% show the agent reliably plans and executes tasks correctly and reports success accurately.

Bad metrics: Precision below 70% means many false success claims, risking trust. Recall below 60% means many missed successful executions, reducing usefulness.

Also, a large gap between precision and recall indicates imbalance: either the agent is too cautious or too optimistic.

Common pitfalls in metrics for Plan-and-execute pattern
  • Accuracy paradox: High overall accuracy can hide poor execution if most tasks are easy or fail by default.
  • Data leakage: If the agent sees test tasks during training, metrics will be unrealistically high.
  • Overfitting: Agent may memorize plans for training tasks but fail new ones, causing low recall.
  • Ignoring execution errors: Only measuring plan quality without execution accuracy misses real-world failures.
Self-check question

Your plan-and-execute agent has 98% accuracy but only 12% recall on successful task completion. Is it good for production? Why or why not?

Answer: No, it is not good. The high accuracy likely comes from many failed tasks correctly identified, but the very low recall means the agent misses almost all successful executions. It fails to complete tasks reliably, so it is not useful in real situations.

Key Result
Task success rate and execution accuracy (precision and recall) are key to evaluate plan-and-execute agents effectively.

Practice

(1/5)
1. What is the main idea behind the plan-and-execute pattern in agentic AI?
easy
A. Execute the whole task at once without planning
B. Randomly try different actions until one works
C. Break a big task into smaller steps and do them one by one
D. Only plan without doing any steps

Solution

  1. Step 1: Understand the pattern purpose

    The plan-and-execute pattern is designed to handle big tasks by dividing them into smaller, manageable steps.
  2. Step 2: Match the description to options

    Break a big task into smaller steps and do them one by one clearly states breaking a big task into smaller steps and doing them one by one, which matches the pattern.
  3. Final Answer:

    Break a big task into smaller steps and do them one by one -> Option C
  4. Quick Check:

    Plan-and-execute = break big task into steps [OK]
Hint: Think: big task needs small steps first [OK]
Common Mistakes:
  • Confusing planning with skipping execution
  • Thinking AI acts randomly without plan
  • Believing the task is done all at once
2. Which of these code snippets correctly shows the start of a plan-and-execute loop in Python?
easy
A. execute(plan) for step in plan:
B. while plan: execute(plan)
C. if plan: execute(plan)
D. for step in plan: execute(step)

Solution

  1. Step 1: Identify correct loop structure for steps

    The plan is a list of steps, so we loop over each step with for step in plan:.
  2. Step 2: Check execution inside loop

    Inside the loop, each step is executed with execute(step), matching the pattern.
  3. Final Answer:

    for step in plan: execute(step) -> Option D
  4. Quick Check:

    Loop over steps then execute each [OK]
Hint: Loop over plan steps, then execute each [OK]
Common Mistakes:
  • Using while without updating plan
  • Executing whole plan at once
  • Incorrect loop syntax or order
3. Given this code snippet using plan-and-execute pattern:
plan = ['step1', 'step2', 'step3']
results = []
for step in plan:
    results.append(f"done {step}")
print(results)

What is the output?
medium
A. ['step1', 'step2', 'step3']
B. ['done step1', 'done step2', 'done step3']
C. ['done step3']
D. Error: append not defined

Solution

  1. Step 1: Understand the loop and append

    The loop goes through each step in plan and appends the string 'done ' plus the step name to results.
  2. Step 2: Trace the results list after loop

    After all steps, results contains ['done step1', 'done step2', 'done step3'].
  3. Final Answer:

    ['done step1', 'done step2', 'done step3'] -> Option B
  4. Quick Check:

    Each step marked done in list [OK]
Hint: Append 'done' + step for each plan item [OK]
Common Mistakes:
  • Confusing original steps with done steps
  • Thinking only last step is appended
  • Assuming append causes error
4. This code tries to implement plan-and-execute but has a bug:
plan = ['step1', 'step2']
for step in plan:
    execute(step)
    plan.remove(step)

What is the main problem?
medium
A. Modifying the plan list while looping causes skipping steps
B. The execute function is not defined
C. The loop should be a while loop
D. There is no problem; code works fine

Solution

  1. Step 1: Analyze loop and list modification

    The code removes items from the plan list while looping over it, which changes the list size and order during iteration.
  2. Step 2: Understand effect on iteration

    Removing items causes the loop to skip some steps because the list indices shift unexpectedly.
  3. Final Answer:

    Modifying the plan list while looping causes skipping steps -> Option A
  4. Quick Check:

    Changing list during loop skips items [OK]
Hint: Never change list while looping over it [OK]
Common Mistakes:
  • Thinking execute is missing
  • Believing while loop fixes skipping
  • Ignoring list modification effects
5. You want an AI to plan and execute cleaning a house room by room. Which approach best uses the plan-and-execute pattern safely and clearly?
hard
A. Create a list of rooms, plan = ['kitchen', 'bathroom', 'bedroom'], then loop: for room in plan: clean(room)
B. Start cleaning randomly without a plan, hoping all rooms get cleaned
C. Plan all rooms but clean only the first one repeatedly
D. Clean the whole house at once without breaking into rooms

Solution

  1. Step 1: Identify safe planning method

    Breaking the big task (clean house) into smaller steps (clean each room) is safe and clear.
  2. Step 2: Match approach to plan-and-execute pattern

    Create a list of rooms, plan = ['kitchen', 'bathroom', 'bedroom'], then loop: for room in plan: clean(room) creates a plan list of rooms and executes cleaning each room in order, matching the pattern well.
  3. Final Answer:

    Create a list of rooms, plan = ['kitchen', 'bathroom', 'bedroom'], then loop: for room in plan: clean(room) -> Option A
  4. Quick Check:

    Plan rooms, then clean each step [OK]
Hint: Plan rooms first, then clean one by one [OK]
Common Mistakes:
  • Skipping planning and acting randomly
  • Repeating same step only
  • Trying to do all at once without steps