RAG (Retrieval-Augmented Generation) models combine searching for relevant information and generating answers. To check how well they work, we need metrics that measure both parts:
- Retrieval quality: How good is the search? We use Recall@k to see if the right documents are found in the top k results.
- Generation quality: How good is the answer? We use BLEU, ROUGE, or METEOR to compare the generated answer to a correct answer.
- End-to-end quality: How useful is the final answer? We use Exact Match (EM) and F1 score on the answer text to check correctness and overlap.
These metrics together tell us if the model finds the right info and uses it well to answer.