Similarity search and retrieval in Prompt Engineering / GenAI - Full Explanation
Start learning this pattern below
Jump into concepts and practice - no test required
Imagine you have a favorite song and want to find other songs that sound similar. Instead of knowing their names, you listen for similar beats, instruments, or moods. A music app does this by analyzing songs' features and quickly suggesting ones that feel alike.
┌─────────────────────┐
│ Input Item (Query) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Feature Representation│
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Similarity Measurement│
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Indexed Database │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Retrieval Results │
└─────────────────────┘Practice
What is the main goal of similarity search in machine learning?
Solution
Step 1: Understand the purpose of similarity search
Similarity search is used to find items that are similar or close to each other in a dataset.Step 2: Compare options with the definition
Only To find items that are close or alike in a collection describes finding similar or close items, which matches the goal of similarity search.Final Answer:
To find items that are close or alike in a collection -> Option CQuick Check:
Similarity search = find similar items [OK]
- Confusing similarity search with sorting
- Thinking similarity search counts items
- Assuming it removes duplicates
Which of the following is the correct way to compute cosine similarity between two vectors A and B in Python using numpy?
import numpy as np A = np.array([1, 2, 3]) B = np.array([4, 5, 6]) # What code computes cosine similarity?
Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of A and B divided by product of their norms.Step 2: Match formula to code options
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches the formula exactly: np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)).Final Answer:
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option DQuick Check:
Cosine similarity = dot / (norm A * norm B) [OK]
- Adding norms instead of multiplying
- Subtracting norms in denominator
- Multiplying dot product by sum of norms
Given the following vectors, what is the cosine similarity between vec1 and vec2?
import numpy as np
vec1 = np.array([1, 0, 0])
vec2 = np.array([0, 1, 0])
cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print("{:.2f}".format(cos_sim))Solution
Step 1: Calculate dot product of vec1 and vec2
Dot product = 1*0 + 0*1 + 0*0 = 0.Step 2: Calculate norms and cosine similarity
Norm of vec1 = 1, norm of vec2 = 1, so cosine similarity = 0 / (1*1) = 0.Final Answer:
0.00 -> Option AQuick Check:
Orthogonal vectors have cosine similarity 0 [OK]
- Confusing dot product with cosine similarity
- Forgetting to divide by norms
- Rounding errors causing wrong answer
Consider this code snippet for similarity search. What is the error?
import numpy as np
vectors = [np.array([1, 2]), np.array([3, 4])]
query = np.array([1, 0])
scores = []
for v in vectors:
score = np.dot(query, v) / np.linalg.norm(query) * np.linalg.norm(v)
scores.append(score)
print(scores)Solution
Step 1: Analyze the cosine similarity formula in code
The formula should divide dot product by product of norms: dot(query, v) / (norm(query) * norm(v)).Step 2: Identify missing parentheses
Code does np.dot(query, v) / np.linalg.norm(query) * np.linalg.norm(v), which computes division then multiplication separately, causing wrong result.Final Answer:
Missing parentheses causing wrong order of operations -> Option AQuick Check:
Use parentheses to group denominator multiplication [OK]
- Forgetting parentheses around denominator
- Using cross product instead of dot product
- Ignoring vector length mismatch
You have a collection of text documents converted into vectors. You want to find the top 2 most similar documents to a new query vector using cosine similarity. Which approach is best?
- Compute cosine similarity between query and each document vector.
- Sort documents by similarity score descending.
- Return top 2 documents.
Which code snippet correctly implements this?
import numpy as np docs = [np.array([1, 0]), np.array([0, 1]), np.array([1, 1])] query = np.array([1, 0]) # Choose the correct code:
Solution
Step 1: Compute cosine similarity correctly
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) computes cosine similarity as dot product divided by product of norms, which is correct.Step 2: Sort indices by similarity descending and select top 2
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) sorts indices by scores descending and selects top 2, matching the requirement.Final Answer:
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) -> Option BQuick Check:
Cosine similarity + sort descending + top 2 = scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) [OK]
- Multiplying norms instead of dividing
- Using cross product instead of dot product
- Sorting ascending instead of descending
