When we re-rank results, we want the best answers to come first. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are important. They measure how high the correct or useful results appear in the list. This matters because users usually look at the top few results only.
Re-ranking retrieved results in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Re-ranking is about ordering, so confusion matrices are less common. Instead, we use ranking tables. For example, if we have 5 results and the relevant ones are at positions 1, 3, and 5, the quality of ranking is better if relevant results are near the top.
Position: 1 2 3 4 5
Relevant?: Yes No Yes No Yes
Ideal: Yes Yes Yes No No
Metrics like NDCG give higher scores when relevant items are near the top.
In re-ranking, precision at top k means how many of the top results are relevant. Recall means how many relevant results are shown overall.
Example: If a search returns 10 results with 3 relevant ones, precision at 5 is how many relevant results are in the first 5. Recall is how many of all relevant results appear anywhere.
Sometimes, showing fewer but very relevant results (high precision) is better, like in a shopping app. Other times, showing all relevant results (high recall) matters, like in legal document search.
Good: High MRR (close to 1), high NDCG (close to 1), and high precision@k (e.g., 0.8 or above) mean relevant results appear early.
Bad: Low MRR (near 0), low NDCG (near 0), and low precision@k (below 0.3) mean relevant results are buried deep or missing.
- Ignoring user intent: Metrics may look good but results may not satisfy what users want.
- Overfitting to training queries: Model ranks well on known queries but fails on new ones.
- Data leakage: Using test data during training inflates metrics falsely.
- Using accuracy: Accuracy is not useful for ranking tasks because it ignores order.
Your re-ranking model has a precision@5 of 0.9 but an MRR of 0.4. Is it good? Why or why not?
Answer: High precision@5 means many relevant results appear in the top 5, which is good. But low MRR means the very first relevant result is often far down the list. This suggests users may not see the best answer immediately. So, the model is good at grouping relevant results but not at ranking the single best result first. Improvement is needed for better user experience.
Practice
What is the main purpose of re-ranking retrieved results in a search system?
Solution
Step 1: Understand the role of re-ranking
Re-ranking means sorting results again after the first search to improve order.Step 2: Identify the goal of re-ranking
The goal is to use a smarter scoring method to show the most relevant results at the top.Final Answer:
To sort the initial search results again using a better scoring method -> Option AQuick Check:
Re-ranking = better sorting [OK]
- Confusing re-ranking with removing duplicates
- Thinking re-ranking speeds up initial search
- Assuming re-ranking translates results
Which of the following code snippets correctly represents a simple re-ranking step that sorts a list of results by their score in descending order?
results = [{'id': 1, 'score': 0.5}, {'id': 2, 'score': 0.9}, {'id': 3, 'score': 0.7}]
# Re-rank results hereSolution
Step 1: Identify sorting by score descending
We want to sort by 'score' in descending order, so reverse=True is needed.Step 2: Check each option
results.sort(key=lambda x: x['score'], reverse=True) sorts by 'score' with reverse=True, which is correct. Others either sort by 'id' or ascending score or missing key.Final Answer:
results.sort(key=lambda x: x['score'], reverse=True) -> Option DQuick Check:
Sort by score descending = results.sort(key=lambda x: x['score'], reverse=True) [OK]
- Forgetting reverse=True for descending sort
- Sorting by wrong key like 'id'
- Using sort without key causing error
Given the following code that re-ranks search results by a new score, what will be the output after re-ranking?
results = [
{'id': 'a', 'score': 0.3},
{'id': 'b', 'score': 0.8},
{'id': 'c', 'score': 0.5}
]
# New scores from a re-ranker
new_scores = {'a': 0.9, 'b': 0.4, 'c': 0.7}
for r in results:
r['score'] = new_scores[r['id']]
results.sort(key=lambda x: x['score'], reverse=True)
print([r['id'] for r in results])Solution
Step 1: Update scores with new_scores
Results get scores: 'a' = 0.9, 'b' = 0.4, 'c' = 0.7.Step 2: Sort results by updated score descending
Sorted order by score: 0.9 ('a'), 0.7 ('c'), 0.4 ('b').Final Answer:
['a', 'c', 'b'] -> Option BQuick Check:
Sort by new scores descending = ['a', 'c', 'b'] [OK]
- Sorting by old scores instead of new
- Sorting ascending instead of descending
- Mixing up ids and scores
Identify the error in this re-ranking code snippet and select the fix:
results = [{'id': 1, 'score': 0.2}, {'id': 2, 'score': 0.5}]
new_scores = {1: 0.7, 2: 0.9}
for r in results:
r['score'] = new_scores[r['id']]
results.sort(key=lambda x: x['score'], reverse=True)
print(results)Solution
Step 1: Check key types in new_scores and results
Both use integer keys for 'id', so lookup works correctly.Step 2: Verify sorting and printing
Sorting by updated 'score' descending is valid and prints sorted list.Final Answer:
No error; code runs correctly and sorts results -> Option CQuick Check:
Matching key types = no error [OK]
- Assuming string keys when they are integers
- Thinking sort() causes error without reason
- Adding unnecessary try-except blocks
You have a list of 5 retrieved documents with initial scores. You want to re-rank them using a machine learning model that outputs a relevance score. Which approach best improves the final ranking?
- Use the model scores to replace initial scores and sort descending.
- Combine initial and model scores by averaging, then sort descending.
- Sort only by initial scores, ignoring model scores.
- Randomly shuffle results to avoid bias.
Solution
Step 1: Understand re-ranking with model scores
Replacing scores fully may ignore useful initial info; combining scores balances both.Step 2: Evaluate options for best ranking
Averaging initial and model scores uses all info, improving relevance and stability.Final Answer:
Combine initial and model scores by averaging, then sort descending -> Option AQuick Check:
Combine scores for best re-ranking [OK]
- Replacing scores blindly losing initial info
- Ignoring model scores completely
- Random shuffling breaks relevance
