For agents that use memory, task success rate and long-term consistency are key metrics. Memory helps agents remember past actions and information, so they can make better decisions over time. Measuring how often the agent completes tasks correctly (success rate) and how well it keeps consistent behavior across steps (consistency) shows if memory is helping.
Why memory makes agents useful in Agentic AI - Why Metrics Matter
Start learning this pattern below
Jump into concepts and practice - no test required
Task Completion Confusion Matrix:
| Predicted Success | Predicted Failure
------|-------------------|-----------------
Actual Success | 85 (TP) | 15 (FN)
Actual Failure | 10 (FP) | 90 (TN)
Total tasks = 200
Precision = TP / (TP + FP) = 85 / (85 + 10) = 0.894
Recall = TP / (TP + FN) = 85 / (85 + 15) = 0.85
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.871
This matrix shows how well the agent with memory predicts task success. High precision means it rarely says success when it fails. High recall means it catches most successes.
Imagine an agent helping a user book flights. If it has high precision, it rarely suggests wrong flights (few false positives), so the user trusts it. But if it has low recall, it might miss some good flight options.
If it has high recall, it finds almost all good flights, but with low precision, it might suggest many bad options, annoying the user.
Memory helps balance this by remembering past preferences and avoiding repeated mistakes, improving both precision and recall over time.
Good metrics: Task success rate above 85%, precision and recall both above 80%, and consistent behavior across sessions.
Bad metrics: Success rate below 60%, precision or recall below 50%, and erratic or contradictory actions showing poor memory use.
Good memory use means the agent learns from past steps and improves. Bad memory use means it forgets or repeats errors.
Accuracy paradox: An agent might have high overall accuracy by guessing common outcomes but fail on important rare tasks.
Data leakage: If the agent's memory accidentally includes future information, metrics look better but don't reflect real use.
Overfitting: The agent might memorize specific past tasks perfectly but fail to generalize to new ones, showing high training success but low real-world performance.
Your agent has 98% accuracy but only 12% recall on important tasks. Is it good for production? Why not?
Answer: No, it is not good. The low recall means the agent misses most important tasks, even if overall accuracy is high. This means it often fails when it matters most, so memory or decision-making needs improvement.
Practice
Solution
Step 1: Understand the role of memory in agents
Memory stores past information that the agent can use later.Step 2: Connect memory to decision-making
Remembering past events helps the agent make smarter choices.Final Answer:
It helps the agent remember past information to make better decisions. -> Option BQuick Check:
Memory improves decisions = A [OK]
- Thinking memory speeds up code execution
- Confusing memory with interface design
- Assuming memory reduces code size
Solution
Step 1: Define agent memory
Memory is where the agent keeps past experiences or information.Step 2: Eliminate incorrect options
Deleting data or forgetting instantly is opposite of memory's purpose.Final Answer:
A place where the agent stores past experiences. -> Option AQuick Check:
Memory stores past info = C [OK]
- Confusing memory with forgetting
- Thinking memory only stores names
- Believing memory deletes data after each step
memory = []
for event in ['rain', 'sun', 'rain']:
memory.append(event)
print(memory.count('rain'))What will be the output?
Solution
Step 1: Understand the loop and memory updates
The loop adds 'rain', 'sun', and 'rain' to the memory list.Step 2: Count how many times 'rain' appears
'rain' appears twice in the list, so memory.count('rain') returns 2.Final Answer:
2 -> Option DQuick Check:
Count of 'rain' = 2 [OK]
- Counting only once instead of twice
- Confusing list length with count
- Assuming count returns total list size
memory = []
events = ['rain', 'sun', 'rain']
for event in events:
if event not in memory:
memory.append(event)
print(memory)What is the output?
Solution
Step 1: Check how memory stores unique events
The code adds 'rain' first, then 'sun', and skips the second 'rain' because it's already in memory.Step 2: Review the final memory list
Memory contains ['rain', 'sun'] after the loop finishes.Final Answer:
['rain', 'sun'] -> Option AQuick Check:
Memory stores unique events = D [OK]
- Assuming all events are added including duplicates
- Mixing order of events in memory
- Forgetting the 'if' condition effect
memory = {}
inputs = [('color', 'blue'), ('food', 'pizza'), ('color', 'green')]
for key, value in inputs:
memory[key] = value
print(memory)What is the final content of
memory and why does this show memory's usefulness?Solution
Step 1: Analyze how dictionary memory updates
Each key in the dictionary is updated with the latest value; 'color' changes from 'blue' to 'green'.Step 2: Understand why this helps personalization
Memory keeps the latest user preferences, so the agent can respond based on current info.Final Answer:
{'color': 'green', 'food': 'pizza'} because memory updates preferences, enabling personalization. -> Option CQuick Check:
Memory updates preferences = B [OK]
- Thinking dictionary stores duplicate keys
- Assuming memory clears after each input
- Ignoring key update behavior in dictionaries
