Prompt Engineering / GenAIml~8 mins

Re-ranking retrieved results in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Re-ranking retrieved results

Which metric matters for re-ranking retrieved results and WHY

When we re-rank results, we want the best answers to come first. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are important. They measure how high the correct or useful results appear in the list. This matters because users usually look at the top few results only.

Confusion matrix or equivalent visualization

Re-ranking is about ordering, so confusion matrices are less common. Instead, we use ranking tables. For example, if we have 5 results and the relevant ones are at positions 1, 3, and 5, the quality of ranking is better if relevant results are near the top.

Position:      1   2   3   4   5
Relevant?:    Yes  No  Yes  No  Yes

Ideal:        Yes  Yes  Yes  No  No

Metrics like NDCG give higher scores when relevant items are near the top.

Precision vs Recall tradeoff with concrete examples

In re-ranking, precision at top k means how many of the top results are relevant. Recall means how many relevant results are shown overall.

Example: If a search returns 10 results with 3 relevant ones, precision at 5 is how many relevant results are in the first 5. Recall is how many of all relevant results appear anywhere.

Sometimes, showing fewer but very relevant results (high precision) is better, like in a shopping app. Other times, showing all relevant results (high recall) matters, like in legal document search.

What "good" vs "bad" metric values look like for re-ranking

Good: High MRR (close to 1), high NDCG (close to 1), and high precision@k (e.g., 0.8 or above) mean relevant results appear early.

Bad: Low MRR (near 0), low NDCG (near 0), and low precision@k (below 0.3) mean relevant results are buried deep or missing.

Metrics pitfalls

Ignoring user intent: Metrics may look good but results may not satisfy what users want.
Overfitting to training queries: Model ranks well on known queries but fails on new ones.
Data leakage: Using test data during training inflates metrics falsely.
Using accuracy: Accuracy is not useful for ranking tasks because it ignores order.

Self-check question

Your re-ranking model has a precision@5 of 0.9 but an MRR of 0.4. Is it good? Why or why not?

Answer: High precision@5 means many relevant results appear in the top 5, which is good. But low MRR means the very first relevant result is often far down the list. This suggests users may not see the best answer immediately. So, the model is good at grouping relevant results but not at ranking the single best result first. Improvement is needed for better user experience.

Key Result

For re-ranking, metrics like MRR and NDCG best show if relevant results appear early, improving user satisfaction.