For memory persistence and storage in AI agents, key metrics include data retrieval accuracy and latency. Data retrieval accuracy measures how correctly the stored information is recalled when needed. Latency measures how fast the memory system responds. These matter because an AI agent must remember past information correctly and quickly to act effectively over time.
Memory persistence and storage in Agentic AI - Model Metrics & Evaluation
While traditional confusion matrices apply to classification, for memory persistence we can think of a retrieval confusion matrix:
Retrieved Correctly | Retrieved Incorrectly
------------------------------------------
True Positive (TP) | False Positive (FP)
------------------------------------------
False Negative (FN) | True Negative (TN)
------------------------------------------
Here, TP means the memory returned the correct stored info, FP means it returned wrong info, FN means it failed to retrieve stored info, and TN means correctly identified no info to retrieve.
In memory systems:
- Precision means when the system retrieves info, how often it is correct. High precision avoids wrong memories.
- Recall means how often the system finds all relevant stored info. High recall avoids forgetting.
Example: A personal assistant AI that remembers your preferences.
- If precision is low, it might recall wrong preferences, causing bad suggestions.
- If recall is low, it might forget some preferences, missing chances to help.
Balancing precision and recall is key: better to remember correctly (precision) but also not forget important info (recall).
Good values:
- Precision and recall both above 90% means memory is reliable and complete.
- Low latency (e.g., under 100 ms) means fast access to stored info.
Bad values:
- Precision below 70% means many wrong memories retrieved, confusing the agent.
- Recall below 70% means important info is often forgotten.
- High latency (e.g., over 1 second) makes the agent slow and less responsive.
- Accuracy paradox: High overall accuracy can hide poor recall if most queries have no stored info.
- Data leakage: If memory stores test data accidentally, metrics will be unrealistically high.
- Overfitting: Memory tuned too tightly to training data may fail to generalize to new info.
- Ignoring latency: Good accuracy but slow retrieval harms real-time agent use.
Your AI agent's memory system has 98% accuracy but only 12% recall on important stored info. Is it good for production? Why not?
Answer: No, it is not good. The very low recall means the system forgets most important info, even if it rarely returns wrong info. This harms the agent's ability to act based on past knowledge, making it unreliable despite high accuracy.