Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Vector similarity metrics in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Vector similarity metrics
Problem:You want to measure how similar two vectors are using different similarity metrics. Currently, you use Euclidean distance but it doesn't always reflect similarity well for your data.
Current Metrics:Euclidean distance between sample vectors: 5.0 (example value)
Issue:Euclidean distance can be misleading when vectors have different lengths or directions. You want to try other similarity metrics that better capture the angle or overlap between vectors.
Your Task
Implement and compare cosine similarity and Jaccard similarity with Euclidean distance on example vectors. Show which metric better reflects similarity for given pairs.
Use only numpy and standard Python libraries.
Vectors are numeric and can be floats.
Jaccard similarity should be applied to binary vectors only.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
Prompt Engineering / GenAI
import numpy as np

def euclidean_distance(vec1, vec2):
    return np.linalg.norm(vec1 - vec2)

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot_product / (norm1 * norm2)

def jaccard_similarity(vec1, vec2):
    # Convert numeric vectors to binary by thresholding at 0.5
    bin_vec1 = vec1 > 0.5
    bin_vec2 = vec2 > 0.5
    intersection = np.logical_and(bin_vec1, bin_vec2).sum()
    union = np.logical_or(bin_vec1, bin_vec2).sum()
    if union == 0:
        return 0.0
    return intersection / union

# Example vectors
vector_a = np.array([1.0, 2.0, 3.0, 4.0])
vector_b = np.array([2.0, 3.0, 4.0, 5.0])
vector_c = np.array([0.0, 0.0, 0.0, 0.0])

print(f"Euclidean distance between A and B: {euclidean_distance(vector_a, vector_b):.3f}")
print(f"Cosine similarity between A and B: {cosine_similarity(vector_a, vector_b):.3f}")
print(f"Jaccard similarity between A and B: {jaccard_similarity(vector_a, vector_b):.3f}")

print(f"Euclidean distance between A and C: {euclidean_distance(vector_a, vector_c):.3f}")
print(f"Cosine similarity between A and C: {cosine_similarity(vector_a, vector_c):.3f}")
print(f"Jaccard similarity between A and C: {jaccard_similarity(vector_a, vector_c):.3f}")
Added cosine similarity function to measure angle-based similarity.
Added Jaccard similarity function for binary vector overlap.
Provided example vectors and printed all three similarity metrics for comparison.
Results Interpretation

Before, only Euclidean distance was used, which gave a value of 5.0 (example). After adding cosine and Jaccard similarity, we see:

  • Euclidean distance between A and B: 2.0 (smaller means closer)
  • Cosine similarity between A and B: 0.995 (close to 1 means very similar direction)
  • Jaccard similarity between A and B: 1.0 (full overlap in binary thresholded vectors)

For vectors A and C (zero vector), cosine and Jaccard similarity are 0, showing no similarity, while Euclidean distance is large.

Different similarity metrics capture different aspects of vector similarity. Cosine similarity is good for direction similarity, Jaccard for binary overlap, and Euclidean for absolute distance. Choosing the right metric depends on your data and what similarity means in your context.
Bonus Experiment
Try implementing Manhattan distance and compare it with Euclidean distance on the same vectors.
💡 Hint
Manhattan distance sums absolute differences of vector components and can be more robust to outliers.

Practice

(1/5)
1. Which vector similarity metric measures the angle between two vectors to determine how similar they are?
easy
A. Manhattan distance
B. Euclidean distance
C. Cosine similarity
D. Jaccard similarity

Solution

  1. Step 1: Understand cosine similarity

    Cosine similarity measures the cosine of the angle between two vectors, showing how aligned they are.
  2. Step 2: Compare with other metrics

    Euclidean and Manhattan distances measure gaps, not angles. Jaccard is for sets, not vectors.
  3. Final Answer:

    Cosine similarity -> Option C
  4. Quick Check:

    Angle-based similarity = Cosine similarity [OK]
Hint: Angle means cosine similarity, distance means Euclidean [OK]
Common Mistakes:
  • Confusing distance with angle measurement
  • Thinking Euclidean measures angle
  • Mixing set similarity with vector similarity
2. Which of the following is the correct Python expression to compute cosine similarity between two vectors a and b using numpy?
easy
A. np.linalg.norm(a - b)
B. np.dot(a, b) * (np.linalg.norm(a) + np.linalg.norm(b))
C. np.sum(np.abs(a - b))
D. np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Solution

  1. Step 1: Recall cosine similarity formula

    Cosine similarity = dot product of vectors divided by product of their lengths (norms).
  2. Step 2: Match formula to code

    np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) matches this formula exactly. np.linalg.norm(a - b) is Euclidean distance, C is Manhattan distance, D is incorrect formula.
  3. Final Answer:

    np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) -> Option D
  4. Quick Check:

    Dot product over norms = cosine similarity [OK]
Hint: Cosine = dot product divided by norms product [OK]
Common Mistakes:
  • Using subtraction instead of dot product
  • Multiplying norms instead of dividing
  • Confusing Euclidean with cosine formula
3. Given vectors a = np.array([1, 2, 3]) and b = np.array([4, 5, 6]), what is the output of np.linalg.norm(a - b)?
medium
A. 3.742
B. 5.196
C. 15.0
D. 32.0

Solution

  1. Step 1: Calculate vector difference

    a - b = [1-4, 2-5, 3-6] = [-3, -3, -3]
  2. Step 2: Compute Euclidean norm

    Norm = sqrt((-3)^2 + (-3)^2 + (-3)^2) = sqrt(9+9+9) = sqrt(27) ≈ 5.196
  3. Final Answer:

    5.196 -> Option B
  4. Quick Check:

    Euclidean distance = 5.196 [OK]
Hint: Euclidean norm = sqrt(sum of squared differences) [OK]
Common Mistakes:
  • Forgetting to square differences
  • Calculating sum instead of sqrt of sum
  • Mixing up vector subtraction order
4. Identify the error in this Python code snippet for cosine similarity:
import numpy as np

def cosine_sim(a, b):
    return np.dot(a, b) / np.linalg.norm(a) + np.linalg.norm(b)

print(cosine_sim(np.array([1, 0]), np.array([0, 1])))
medium
A. The denominator should multiply norms, not add them
B. np.dot is used incorrectly; should be np.cross
C. Vectors must be normalized before dot product
D. Function is missing return statement

Solution

  1. Step 1: Analyze denominator in formula

    The code adds norms: np.linalg.norm(a) + np.linalg.norm(b), but cosine similarity divides by their product.
  2. Step 2: Understand correct formula

    Cosine similarity = dot(a,b) / (norm(a) * norm(b)), so addition is wrong here.
  3. Final Answer:

    The denominator should multiply norms, not add them -> Option A
  4. Quick Check:

    Denominator = product of norms [OK]
Hint: Denominator in cosine similarity multiplies norms [OK]
Common Mistakes:
  • Adding norms instead of multiplying
  • Using cross product instead of dot product
  • Forgetting to return value
5. You have two text documents represented as vectors: doc1 = [1, 0, 2, 1] and doc2 = [0, 1, 1, 1]. Which similarity metric is best to find how similar their topics are, and why?
hard
A. Cosine similarity, because it measures angle ignoring length differences
B. Euclidean distance, because it measures exact gap between vectors
C. Manhattan distance, because it sums absolute differences
D. Jaccard similarity, because it compares set overlap

Solution

  1. Step 1: Understand vector meaning in text

    Vectors represent word counts or weights; length can vary by document size.
  2. Step 2: Choose metric ignoring length but capturing direction

    Cosine similarity measures angle, so it focuses on topic similarity ignoring document length differences.
  3. Final Answer:

    Cosine similarity, because it measures angle ignoring length differences -> Option A
  4. Quick Check:

    Topic similarity = cosine similarity [OK]
Hint: For text, angle-based similarity works best [OK]
Common Mistakes:
  • Using Euclidean which is sensitive to length
  • Confusing set similarity with vector similarity
  • Ignoring document length effect