0
0
Prompt Engineering / GenAIml~8 mins

Sentence transformers in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Sentence transformers
Which metric matters for Sentence Transformers and WHY

Sentence transformers create vector representations of sentences. We want these vectors to capture meaning well. So, we measure how well the model groups similar sentences close and different sentences far apart.

Common metrics include Cosine Similarity to check closeness of vectors, and Recall@K or Mean Reciprocal Rank (MRR) to evaluate retrieval tasks. These metrics show if the model finds the right similar sentences.

Confusion matrix or equivalent visualization

For sentence transformers, we often use retrieval evaluation instead of a confusion matrix. Here is a simple example for a retrieval task with 5 queries:

Query | Relevant Sentences Retrieved | Total Retrieved
----------------------------------------------
  1   | 3 (TP)                      | 5
  2   | 2 (TP)                      | 4
  3   | 4 (TP)                      | 5
  4   | 1 (TP)                      | 3
  5   | 5 (TP)                      | 5
    

We count true positives (TP) as relevant sentences found. False positives (FP) are retrieved but irrelevant. False negatives (FN) are relevant but not retrieved.

Precision vs Recall tradeoff with concrete examples

Precision means how many retrieved sentences are actually relevant. Recall means how many relevant sentences were found out of all relevant ones.

Example: If you want to find similar customer reviews, high recall means you find most similar reviews, even if some are less relevant. High precision means most found reviews are very similar, but you might miss some.

Choosing high recall helps when missing a similar sentence is bad, like in legal document search. High precision helps when you want very accurate matches, like in question answering.

What "good" vs "bad" metric values look like for Sentence Transformers

Good: Recall@10 above 0.8 means the model finds 80% of relevant sentences in top 10 results. Cosine similarity scores close to 1 for similar sentences show good embeddings.

Bad: Recall@10 below 0.3 means the model misses many relevant sentences. Low precision means many irrelevant sentences appear in results. Cosine similarity near 0 or negative for similar sentences means poor embeddings.

Common pitfalls in metrics for Sentence Transformers
  • Ignoring dataset balance: If most sentences are unrelated, accuracy can be misleadingly high.
  • Overfitting: Model performs well on training pairs but poorly on new sentences.
  • Data leakage: Using test sentences in training can inflate metrics.
  • Using only accuracy: Accuracy is not meaningful for retrieval tasks; use recall and precision instead.
Self-check question

Your sentence transformer model has a Recall@10 of 0.98 but Precision@10 of 0.12 on a search task. Is it good for production? Why or why not?

Answer: This means the model finds almost all relevant sentences (high recall) but also returns many irrelevant ones (low precision). It may overwhelm users with poor results. Depending on the use case, you might want to improve precision before production.

Key Result
Recall@K and Precision@K are key metrics to evaluate how well sentence transformers find relevant sentences.