When evaluating embeddings that capture semantic meaning, metrics like cosine similarity and Euclidean distance matter most. These metrics measure how close or similar two word or sentence vectors are in space. A smaller distance or higher cosine similarity means the embeddings represent similar meanings. This helps us check if the model understands relationships between words or sentences.
0
0
Why embeddings capture semantic meaning in NLP - Why Metrics Matter
Metrics & Evaluation - Why embeddings capture semantic meaning
Which metric matters for this concept and WHY
Confusion matrix or equivalent visualization (ASCII)
Embedding similarity matrix example (cosine similarity):
cat dog apple car
cat 1.00 0.85 0.10 0.20
dog 0.85 1.00 0.05 0.15
apple 0.10 0.05 1.00 0.30
car 0.20 0.15 0.30 1.00
High values (close to 1) between 'cat' and 'dog' show semantic closeness.
Low values between 'cat' and 'apple' show semantic difference.Precision vs Recall (or equivalent tradeoff) with concrete examples
For embeddings, the tradeoff is between semantic precision and semantic recall.
- Semantic Precision: How often the closest embeddings truly mean the same or similar things. High precision means few false matches.
- Semantic Recall: How many true semantic matches the embeddings find among all possible matches. High recall means few misses.
Example: In a search engine, high semantic precision means the top results are very relevant. High semantic recall means the engine finds most relevant results, even if some are less precise.
What "good" vs "bad" metric values look like for this use case
Good embedding metrics:
- Cosine similarity close to 1 for synonyms or related words (e.g., "car" and "automobile" > 0.8)
- Cosine similarity close to 0 or negative for unrelated words (e.g., "car" and "banana" < 0.2)
- Consistent distances that reflect known semantic relationships
Bad embedding metrics:
- High similarity between unrelated words (false positives)
- Low similarity between synonyms or related words (false negatives)
- Random or noisy similarity scores that do not reflect meaning
Metrics pitfalls (accuracy paradox, data leakage, overfitting indicators)
- Accuracy paradox: Using simple accuracy on classification of embeddings can be misleading because semantic similarity is continuous, not binary.
- Data leakage: If embeddings are trained on test data, similarity scores will be unrealistically high.
- Overfitting: Embeddings that memorize training pairs may show perfect similarity on training but fail on new words.
- Ignoring context: Static embeddings may fail to capture meaning changes in different sentences.
Your model has 98% accuracy but 12% recall on fraud. Is it good?
This question is about fraud detection, not embeddings, but it teaches an important lesson.
Even with 98% accuracy, 12% recall means the model misses 88% of fraud cases. This is bad because catching fraud is critical. High recall is more important here.
Similarly, for embeddings, a metric must match the goal. High similarity scores alone don't guarantee good semantic understanding if many true matches are missed.
Key Result
Cosine similarity is key to measure how well embeddings capture semantic meaning by showing closeness of related words.