When a model remembers past conversation, we want to check how well it keeps important details without mixing up or forgetting. Key metrics include Recall to see if the model remembers all relevant past info, and Precision to ensure it doesn't add wrong or unrelated info. Also, F1 score balances these two. These metrics help us know if the memory is accurate and complete, which is vital for smooth, meaningful chats.
Memory for conversation history in Prompt Engineering / GenAI - Model Metrics & Evaluation
|-----------------------------|
| | Predicted Memory |
| Actual | Relevant | Wrong |
| Memory | | |
|-----------------------------|
| Relevant | TP | FP |
| Wrong | FN | TN |
|-----------------------------|
TP = Correctly remembered important info
FP = Incorrect or unrelated info remembered
FN = Important info forgotten
TN = Correctly ignored irrelevant info
High Precision, Low Recall: The model remembers only very sure facts, avoiding mistakes but forgetting some details. Good if wrong info is harmful.
High Recall, Low Precision: The model tries to remember everything, including some wrong or irrelevant details. Good if missing info is worse than some mistakes.
For example, in a customer support chat, high recall helps remember all user issues, but high precision avoids confusing the user with wrong info.
Good: Precision and Recall both above 0.8 means the model remembers most important info and rarely adds wrong details.
Bad: Precision below 0.5 means many wrong memories; Recall below 0.5 means many forgotten details. Either harms conversation quality.
- Accuracy paradox: High overall accuracy can hide poor memory if irrelevant info dominates.
- Data leakage: If test data leaks past conversation, metrics look better but don't reflect real memory ability.
- Overfitting: Model may memorize training chats perfectly but fail on new conversations.
Your chat model has 98% accuracy remembering conversation history but only 12% recall on important past details. Is it good for real use? Why or why not?
Answer: No, it is not good. The low recall means the model forgets most important info, even if overall accuracy looks high. This will cause poor chat quality because key details are missed.