This pipeline finds items similar to a query by comparing their features. It helps retrieve the closest matches from a large collection quickly.
Similarity search and retrieval in Prompt Engineering / GenAI - Model Pipeline Trace
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - Similarity search and retrieval
Data Flow - 5 Stages
1Data in
10000 items x 300 features→Raw feature vectors representing items→10000 items x 300 features
↓
2Preprocessing
10000 items x 300 features→Normalize vectors to unit length→10000 items x 300 features
↓
3Feature Engineering
1 query item x 300 features→Normalize query vector to unit length→1 query item x 300 features
↓
4Similarity Computation
Query vector (1 x 300) and dataset (10000 x 300)→Compute cosine similarity between query and all items→10000 similarity scores
↓
5Retrieval
10000 similarity scores→Sort scores and select top 5 items→5 items with highest similarity
Training Trace - Epoch by Epoch
Loss
0.5 |****
0.4 |****
0.3 |****
0.2 |****
0.1 |****
+------------
1 2 3 4 5 Epochs
| Epoch | Loss ↓ | Accuracy ↑ | Observation |
|---|---|---|---|
| 1 | 0.45 | 0.60 | Initial training with random embeddings |
| 2 | 0.35 | 0.72 | Loss decreased, accuracy improved |
| 3 | 0.28 | 0.80 | Model learning meaningful features |
| 4 | 0.22 | 0.85 | Good convergence, stable improvement |
| 5 | 0.18 | 0.89 | Final epoch with strong similarity predictions |
Prediction Trace - 4 Layers
Layer 1: Input query vector
Layer 2: Normalize query vector
Layer 3: Compute cosine similarity
Layer 4: Sort and select top matches
Model Quiz - 3 Questions
Test your understanding
What does normalizing vectors before similarity calculation help with?
Key Insight
Practice
1.
What is the main goal of similarity search in machine learning?
easy
Solution
Step 1: Understand the purpose of similarity search
Similarity search is used to find items that are similar or close to each other in a dataset.Step 2: Compare options with the definition
Only To find items that are close or alike in a collection describes finding similar or close items, which matches the goal of similarity search.Final Answer:
To find items that are close or alike in a collection -> Option CQuick Check:
Similarity search = find similar items [OK]
Hint: Similarity search finds close or alike items [OK]
Common Mistakes:
- Confusing similarity search with sorting
- Thinking similarity search counts items
- Assuming it removes duplicates
2.
Which of the following is the correct way to compute cosine similarity between two vectors A and B in Python using numpy?
import numpy as np A = np.array([1, 2, 3]) B = np.array([4, 5, 6]) # What code computes cosine similarity?
easy
Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of A and B divided by product of their norms.Step 2: Match formula to code options
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches the formula exactly: np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)).Final Answer:
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option DQuick Check:
Cosine similarity = dot / (norm A * norm B) [OK]
Hint: Cosine similarity = dot product divided by norms product [OK]
Common Mistakes:
- Adding norms instead of multiplying
- Subtracting norms in denominator
- Multiplying dot product by sum of norms
3.
Given the following vectors, what is the cosine similarity between vec1 and vec2?
import numpy as np
vec1 = np.array([1, 0, 0])
vec2 = np.array([0, 1, 0])
cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print("{:.2f}".format(cos_sim))medium
Solution
Step 1: Calculate dot product of vec1 and vec2
Dot product = 1*0 + 0*1 + 0*0 = 0.Step 2: Calculate norms and cosine similarity
Norm of vec1 = 1, norm of vec2 = 1, so cosine similarity = 0 / (1*1) = 0.Final Answer:
0.00 -> Option AQuick Check:
Orthogonal vectors have cosine similarity 0 [OK]
Hint: Orthogonal vectors have cosine similarity zero [OK]
Common Mistakes:
- Confusing dot product with cosine similarity
- Forgetting to divide by norms
- Rounding errors causing wrong answer
4.
Consider this code snippet for similarity search. What is the error?
import numpy as np
vectors = [np.array([1, 2]), np.array([3, 4])]
query = np.array([1, 0])
scores = []
for v in vectors:
score = np.dot(query, v) / np.linalg.norm(query) * np.linalg.norm(v)
scores.append(score)
print(scores)medium
Solution
Step 1: Analyze the cosine similarity formula in code
The formula should divide dot product by product of norms: dot(query, v) / (norm(query) * norm(v)).Step 2: Identify missing parentheses
Code does np.dot(query, v) / np.linalg.norm(query) * np.linalg.norm(v), which computes division then multiplication separately, causing wrong result.Final Answer:
Missing parentheses causing wrong order of operations -> Option AQuick Check:
Use parentheses to group denominator multiplication [OK]
Hint: Use parentheses to group denominator in cosine similarity [OK]
Common Mistakes:
- Forgetting parentheses around denominator
- Using cross product instead of dot product
- Ignoring vector length mismatch
5.
You have a collection of text documents converted into vectors. You want to find the top 2 most similar documents to a new query vector using cosine similarity. Which approach is best?
- Compute cosine similarity between query and each document vector.
- Sort documents by similarity score descending.
- Return top 2 documents.
Which code snippet correctly implements this?
import numpy as np docs = [np.array([1, 0]), np.array([0, 1]), np.array([1, 1])] query = np.array([1, 0]) # Choose the correct code:
hard
Solution
Step 1: Compute cosine similarity correctly
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) computes cosine similarity as dot product divided by product of norms, which is correct.Step 2: Sort indices by similarity descending and select top 2
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) sorts indices by scores descending and selects top 2, matching the requirement.Final Answer:
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) -> Option BQuick Check:
Cosine similarity + sort descending + top 2 = scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs] top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2] print(top2) [OK]
Hint: Compute cosine similarity, sort descending, pick top results [OK]
Common Mistakes:
- Multiplying norms instead of dividing
- Using cross product instead of dot product
- Sorting ascending instead of descending
