In Retrieval-Augmented Generation (RAG) models, which metric best measures how well the retrieval component finds relevant documents?
Think about which metric evaluates the retrieval step's ability to find relevant documents.
Recall@k is used to evaluate retrieval quality by checking if relevant documents appear in the top k retrieved results. BLEU and F1 focus on generation quality, while perplexity measures language model uncertainty.
Which metric is most appropriate to evaluate the quality of the generated answers in a RAG system compared to reference answers?
Focus on metrics that compare generated text to reference text.
ROUGE-L measures the quality of generated text by comparing longest common subsequence overlap with reference answers, making it suitable for generation evaluation. Recall@k and Top-1 accuracy relate to retrieval, and MSE on embeddings is less common for direct text quality.
What is the output of this Python code that calculates Recall@3 for a RAG retrieval step?
relevant_docs = {101, 102, 103}
retrieved_docs = [104, 102, 105, 101, 106]
recall_at_3 = len(set(retrieved_docs[:3]) & relevant_docs) / len(relevant_docs)
print(round(recall_at_3, 2))Check which relevant documents appear in the first 3 retrieved docs.
The first 3 retrieved docs are [104, 102, 105]. Only 102 is relevant. So recall@3 = 1/3 ≈ 0.33.
You want to evaluate a RAG model end-to-end, considering both retrieval and generation quality. Which combined metric approach is best?
Think about combining retrieval and generation metrics for full evaluation.
Evaluating RAG end-to-end requires assessing both retrieval (Recall@k) and generation (ROUGE-L). Averaging these gives a balanced view. Using only one ignores part of the pipeline.
Given this code snippet for Recall@5, what is the bug causing incorrect recall calculation?
relevant = [10, 20, 30] retrieved = [20, 10, 40, 50, 60] recall = len(set(retrieved[:5]) & set(relevant)) / len(retrieved[:5]) print(round(recall, 2))
Recall measures fraction of relevant items retrieved, so denominator matters.
Recall is defined as (relevant retrieved) / (total relevant). The code incorrectly divides by number of retrieved documents, causing wrong recall value.