Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Vector similarity metrics in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Vector similarity metrics
Which metric matters for Vector similarity metrics and WHY

Vector similarity metrics measure how alike two vectors are. They help find items that are close or related in meaning or features. Common metrics include Cosine similarity, Euclidean distance, and Manhattan distance.

Cosine similarity is popular because it measures the angle between vectors, ignoring their length. This is useful when direction matters more than size, like comparing text meanings.

Euclidean distance measures straight-line distance between points, useful when absolute difference matters.

Choosing the right metric depends on your data and what "similar" means in your task.

Confusion matrix or equivalent visualization

Vector similarity does not use a confusion matrix like classification. Instead, we look at similarity scores between pairs.

    Example: Comparing query vector Q with database vectors A, B, C

    Vector pairs and Cosine similarity scores:
    Q & A: 0.95 (very similar)
    Q & B: 0.60 (somewhat similar)
    Q & C: 0.10 (not similar)

    Higher scores mean more similarity (max 1.0).
    
Precision vs Recall tradeoff with concrete examples

When using vector similarity for search or recommendations, you pick a similarity threshold to decide what counts as "similar enough."

High threshold (e.g., 0.9): Only very close matches are returned. This means high precision (few wrong matches) but low recall (may miss some relevant items).

Low threshold (e.g., 0.5): More items are returned, including less similar ones. This means high recall (finds most relevant items) but lower precision (more irrelevant items included).

Example: In a movie recommendation system, a high threshold shows only very similar movies (precise but fewer), while a low threshold shows many movies including less related ones.

What "good" vs "bad" metric values look like for Vector similarity

Good: Similar items have high similarity scores (close to 1 for cosine), and dissimilar items have low scores (close to 0 or negative for cosine). Clear separation helps make confident decisions.

Bad: Scores cluster around the middle (e.g., 0.5) for all pairs, making it hard to tell similar from dissimilar. This means the metric or vector representation is not capturing meaningful differences.

Metrics pitfalls
  • Ignoring vector normalization: Cosine similarity does not require vectors to be normalized; it inherently measures the angle between vectors regardless of their length. However, normalizing vectors can improve numerical stability.
  • Using Euclidean distance on high-dimensional sparse data: Can cause "curse of dimensionality" where distances become less meaningful.
  • Choosing wrong metric for data type: For example, cosine similarity is better for text embeddings, but Euclidean might be better for physical coordinates.
  • Threshold selection without validation: Picking similarity cutoffs without testing can lead to poor precision or recall.
Self-check question

Your search system uses cosine similarity with a threshold of 0.8. You find many relevant results but also many irrelevant ones. What should you do?

Answer: Lowering the threshold will increase recall but reduce precision, so to reduce irrelevant results, you should raise the threshold above 0.8 to get fewer but more precise matches.

Key Result
Vector similarity metrics like cosine similarity measure how close vectors are; choosing the right metric and threshold balances precision and recall in similarity tasks.

Practice

(1/5)
1. Which vector similarity metric measures the angle between two vectors to determine how similar they are?
easy
A. Manhattan distance
B. Euclidean distance
C. Cosine similarity
D. Jaccard similarity

Solution

  1. Step 1: Understand cosine similarity

    Cosine similarity measures the cosine of the angle between two vectors, showing how aligned they are.
  2. Step 2: Compare with other metrics

    Euclidean and Manhattan distances measure gaps, not angles. Jaccard is for sets, not vectors.
  3. Final Answer:

    Cosine similarity -> Option C
  4. Quick Check:

    Angle-based similarity = Cosine similarity [OK]
Hint: Angle means cosine similarity, distance means Euclidean [OK]
Common Mistakes:
  • Confusing distance with angle measurement
  • Thinking Euclidean measures angle
  • Mixing set similarity with vector similarity
2. Which of the following is the correct Python expression to compute cosine similarity between two vectors a and b using numpy?
easy
A. np.linalg.norm(a - b)
B. np.dot(a, b) * (np.linalg.norm(a) + np.linalg.norm(b))
C. np.sum(np.abs(a - b))
D. np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Solution

  1. Step 1: Recall cosine similarity formula

    Cosine similarity = dot product of vectors divided by product of their lengths (norms).
  2. Step 2: Match formula to code

    np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) matches this formula exactly. np.linalg.norm(a - b) is Euclidean distance, C is Manhattan distance, D is incorrect formula.
  3. Final Answer:

    np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) -> Option D
  4. Quick Check:

    Dot product over norms = cosine similarity [OK]
Hint: Cosine = dot product divided by norms product [OK]
Common Mistakes:
  • Using subtraction instead of dot product
  • Multiplying norms instead of dividing
  • Confusing Euclidean with cosine formula
3. Given vectors a = np.array([1, 2, 3]) and b = np.array([4, 5, 6]), what is the output of np.linalg.norm(a - b)?
medium
A. 3.742
B. 5.196
C. 15.0
D. 32.0

Solution

  1. Step 1: Calculate vector difference

    a - b = [1-4, 2-5, 3-6] = [-3, -3, -3]
  2. Step 2: Compute Euclidean norm

    Norm = sqrt((-3)^2 + (-3)^2 + (-3)^2) = sqrt(9+9+9) = sqrt(27) ≈ 5.196
  3. Final Answer:

    5.196 -> Option B
  4. Quick Check:

    Euclidean distance = 5.196 [OK]
Hint: Euclidean norm = sqrt(sum of squared differences) [OK]
Common Mistakes:
  • Forgetting to square differences
  • Calculating sum instead of sqrt of sum
  • Mixing up vector subtraction order
4. Identify the error in this Python code snippet for cosine similarity:
import numpy as np

def cosine_sim(a, b):
    return np.dot(a, b) / np.linalg.norm(a) + np.linalg.norm(b)

print(cosine_sim(np.array([1, 0]), np.array([0, 1])))
medium
A. The denominator should multiply norms, not add them
B. np.dot is used incorrectly; should be np.cross
C. Vectors must be normalized before dot product
D. Function is missing return statement

Solution

  1. Step 1: Analyze denominator in formula

    The code adds norms: np.linalg.norm(a) + np.linalg.norm(b), but cosine similarity divides by their product.
  2. Step 2: Understand correct formula

    Cosine similarity = dot(a,b) / (norm(a) * norm(b)), so addition is wrong here.
  3. Final Answer:

    The denominator should multiply norms, not add them -> Option A
  4. Quick Check:

    Denominator = product of norms [OK]
Hint: Denominator in cosine similarity multiplies norms [OK]
Common Mistakes:
  • Adding norms instead of multiplying
  • Using cross product instead of dot product
  • Forgetting to return value
5. You have two text documents represented as vectors: doc1 = [1, 0, 2, 1] and doc2 = [0, 1, 1, 1]. Which similarity metric is best to find how similar their topics are, and why?
hard
A. Cosine similarity, because it measures angle ignoring length differences
B. Euclidean distance, because it measures exact gap between vectors
C. Manhattan distance, because it sums absolute differences
D. Jaccard similarity, because it compares set overlap

Solution

  1. Step 1: Understand vector meaning in text

    Vectors represent word counts or weights; length can vary by document size.
  2. Step 2: Choose metric ignoring length but capturing direction

    Cosine similarity measures angle, so it focuses on topic similarity ignoring document length differences.
  3. Final Answer:

    Cosine similarity, because it measures angle ignoring length differences -> Option A
  4. Quick Check:

    Topic similarity = cosine similarity [OK]
Hint: For text, angle-based similarity works best [OK]
Common Mistakes:
  • Using Euclidean which is sensitive to length
  • Confusing set similarity with vector similarity
  • Ignoring document length effect