When we talk about embeddings capturing semantic meaning, the key metric is cosine similarity. This metric measures how close two vectors are in direction, regardless of their length. Since embeddings are vectors representing words or sentences, cosine similarity tells us how similar their meanings are. A higher cosine similarity means the embeddings share more semantic meaning.
Why embeddings capture semantic meaning in Prompt Engineering / GenAI - Why Metrics Matter
Start learning this pattern below
Jump into concepts and practice - no test required
Example: Comparing embeddings of words "cat", "dog", and "car" using cosine similarity
cat dog car
cat 1.00 0.85 0.10
dog 0.85 1.00 0.12
car 0.10 0.12 1.00
Here, "cat" and "dog" have high similarity (0.85), showing semantic closeness.
"cat" and "car" have low similarity (0.10), showing different meanings.In semantic search or recommendation systems using embeddings, precision means how many of the retrieved items are truly relevant (semantically close). Recall means how many of all relevant items were found.
For example, if you search for "apple" meaning the fruit, high precision means most results are about fruit, not the company. High recall means you find most fruit-related items.
Sometimes increasing recall (finding more related items) lowers precision (some unrelated items appear). Balancing these depends on the application.
A good embedding model will have:
- High cosine similarity (close to 1) for semantically similar words or sentences.
- Low cosine similarity (close to 0 or negative) for unrelated meanings.
A bad model might show high similarity for unrelated words, confusing meanings, or low similarity for synonyms.
- Ignoring vector length: Using Euclidean distance instead of cosine similarity can mislead semantic closeness.
- Overfitting embeddings: Embeddings trained on small data may memorize instead of generalizing meaning.
- Data leakage: If test words appear in training, similarity scores may be artificially high.
- Ignoring context: Static embeddings ignore word meaning changes in sentences, lowering real semantic capture.
Your embedding model shows cosine similarity of 0.95 between "bank" (financial) and "river". Is this good? Why or why not?
Answer: No, this is not good. "Bank" and "river" have different meanings here. High similarity means the model confuses meanings and does not capture semantic differences well.
Practice
Solution
Step 1: Understand what embeddings do
Embeddings convert words or ideas into numbers that capture their meaning.Step 2: Recognize why this helps computers
Numbers allow computers to compare and find similarities between words easily.Final Answer:
Because they turn words into numbers that show meaning -> Option BQuick Check:
Embeddings = numbers showing meaning [OK]
- Thinking embeddings store images
- Confusing embeddings with translation
- Believing embeddings count letters
Solution
Step 1: Identify the correct technical description
Embeddings represent words as vectors (lists) of numbers.Step 2: Eliminate incorrect options
Raw text, pictures, and frequency counts do not capture semantic meaning as embeddings do.Final Answer:
Embeddings map words to vectors of numbers -> Option DQuick Check:
Embeddings = vectors of numbers [OK]
- Confusing embeddings with raw text storage
- Thinking embeddings are images
- Mixing embeddings with word counts
embedding1 = [0.1, 0.3, 0.5] and embedding2 = [0.1, 0.31, 0.49], what can we say about their semantic similarity?Solution
Step 1: Compare the two embeddings numerically
The numbers are close but not identical, showing some similarity.Step 2: Understand what closeness means in embeddings
Close embeddings mean similar meanings, but not exactly the same.Final Answer:
They are somewhat similar in meaning -> Option CQuick Check:
Close vectors = similar meaning [OK]
- Assuming small differences mean no similarity
- Thinking embeddings must be identical to be similar
- Ignoring numerical closeness
embedding1 = [0.2, 0.4, 0.6] embedding2 = [0.2, 0.4, 0.6] similarity = sum(embedding1[i] * embedding2[i] for i in range(3)) print(similarity)
What is the error in this code?
Solution
Step 1: Analyze the code logic
The code calculates the dot product by summing element-wise products.Step 2: Check if this is a valid similarity measure
Dot product is a common way to measure similarity between embeddings.Final Answer:
The code correctly computes dot product similarity -> Option AQuick Check:
Dot product code is correct [OK]
- Thinking sum can't be used with generator expressions
- Believing normalization is always required
- Confusing indices usage
'cat', 'dog', and 'car'. Which embedding pair is expected to be closest in meaning and why?Solution
Step 1: Understand semantic meaning in embeddings
Embeddings capture meaning, so similar concepts have closer embeddings.Step 2: Compare the word pairs by meaning
'Cat' and 'dog' are both animals, so their embeddings should be closer than unrelated words.Final Answer:
Embeddings of 'cat' and 'dog' because both are animals -> Option AQuick Check:
Similar meaning = closer embeddings [OK]
- Choosing words based on spelling or sound
- Ignoring actual meaning of words
- Assuming letter count affects embeddings
