0
0
Agentic AIml~8 mins

Long-term memory with vector stores in Agentic AI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Long-term memory with vector stores
Which metric matters for Long-term memory with vector stores and WHY

When using vector stores for long-term memory, the key metric is Recall. This is because we want to find all relevant past information stored as vectors when a query comes in. Missing important memories means the system forgets useful knowledge.

Another important metric is Precision, which tells us how many retrieved memories are actually relevant. High precision means fewer distractions from unrelated memories.

We also look at F1 score to balance recall and precision, ensuring the memory retrieval is both complete and accurate.

For ranking results, Mean Average Precision (MAP) or Normalized Discounted Cumulative Gain (NDCG) can measure how well the most relevant memories appear at the top.

Confusion matrix for memory retrieval
    Relevant | Irrelevant
    ---------------------------------------
    True Positive (TP) | False Positive (FP)
    ---------------------------------------
    False Negative (FN)| True Negative (TN)
    

Example: Suppose the system retrieves 8 relevant memories (TP), 2 irrelevant ones (FP), misses 3 relevant memories (FN), and correctly ignores 7 irrelevant memories (TN).

Totals: TP=8, FP=2, FN=3, TN=7, Total=20

Precision vs Recall tradeoff with examples

If the system retrieves many memories to avoid missing any (high recall), it may include irrelevant ones (low precision). This can confuse the AI with too much noise.

If it retrieves only very confident memories (high precision), it might miss some useful ones (low recall), causing the AI to forget important facts.

For example, a customer support AI using long-term memory should have high recall to remember all past issues, but also good precision to avoid irrelevant past cases.

What good vs bad metric values look like

Good: Recall and precision both above 0.8 means the system finds most relevant memories and keeps irrelevant ones low.

Bad: Recall below 0.5 means many relevant memories are missed, hurting AI's knowledge. Precision below 0.5 means many irrelevant memories confuse the AI.

F1 score below 0.6 suggests poor balance, needing tuning of vector search parameters or better embeddings.

Common pitfalls in metrics for vector store memory
  • Accuracy paradox: High accuracy can be misleading if most memories are irrelevant and the system just returns few results.
  • Data leakage: If test queries are too similar to stored vectors, metrics look better than real use.
  • Overfitting: Tuning vector search too tightly on test data can reduce generalization to new queries.
  • Ignoring ranking metrics: Only counting retrieved vs missed memories misses how well top results are ordered.
Self-check question

Your long-term memory system has 98% accuracy but only 12% recall on relevant memories. Is it good for production?

Answer: No. The high accuracy is misleading because most memories are irrelevant. The very low recall means the system misses almost all relevant memories, so it forgets important information. This hurts AI performance and user experience.

Key Result
Recall is most important to ensure relevant memories are found; balance with precision to avoid noise.