Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is similarity search in machine learning?
Similarity search is a method to find items that are most alike a given item, based on some measure of closeness or resemblance.
Click to reveal answer
beginner
Name a common way to measure similarity between two data points.
Cosine similarity is a common measure that calculates the angle between two vectors to determine how similar they are.
Click to reveal answer
beginner
Why is vector representation important in similarity search?
Vector representation converts data into numbers so computers can measure similarity using math, like distances or angles between vectors.
Click to reveal answer
intermediate
What is the role of an index in similarity search and retrieval?
An index organizes data vectors so the system can quickly find the most similar items without checking every single one.
Click to reveal answer
intermediate
Explain the difference between exact and approximate similarity search.
Exact search finds the perfect closest matches but can be slow for big data. Approximate search finds close matches faster but might miss the very best ones.
Click to reveal answer
Which similarity measure calculates the angle between two vectors?
AManhattan distance
BEuclidean distance
CJaccard index
DCosine similarity
✗ Incorrect
Cosine similarity measures the cosine of the angle between two vectors, indicating how similar their directions are.
What is the main purpose of an index in similarity search?
ATo train machine learning models
BTo store raw data
CTo speed up finding similar items
DTo visualize data
✗ Incorrect
An index helps quickly locate similar items without scanning all data, improving search speed.
Which of these is a drawback of exact similarity search?
AIt cannot handle vectors
BIt can be slow on large datasets
CIt uses approximate results
DIt is inaccurate
✗ Incorrect
Exact search checks all data points, which can be slow when the dataset is very large.
Vector representation is important because:
AIt allows mathematical comparison of data
BIt stores data as text
CIt removes the need for similarity measures
DIt visualizes data
✗ Incorrect
Vectors let computers use math to compare data points for similarity.
Which similarity measure is best for comparing sets of items?
AJaccard index
BCosine similarity
CEuclidean distance
DPearson correlation
✗ Incorrect
Jaccard index measures similarity between sets by comparing shared and total items.
Describe how similarity search works and why it is useful in real life.
Think about how online stores suggest products you might like.
You got /4 concepts.
Explain the difference between exact and approximate similarity search and when you might use each.
Consider searching a huge photo collection quickly.
You got /3 concepts.
Practice
(1/5)
1.
What is the main goal of similarity search in machine learning?
easy
A. To count the number of items in a dataset
B. To sort items alphabetically
C. To find items that are close or alike in a collection
D. To remove duplicate items from a list
Solution
Step 1: Understand the purpose of similarity search
Similarity search is used to find items that are similar or close to each other in a dataset.
Step 2: Compare options with the definition
Only To find items that are close or alike in a collection describes finding similar or close items, which matches the goal of similarity search.
Final Answer:
To find items that are close or alike in a collection -> Option C
Quick Check:
Similarity search = find similar items [OK]
Hint: Similarity search finds close or alike items [OK]
Common Mistakes:
Confusing similarity search with sorting
Thinking similarity search counts items
Assuming it removes duplicates
2.
Which of the following is the correct way to compute cosine similarity between two vectors A and B in Python using numpy?
import numpy as np
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])
# What code computes cosine similarity?
easy
A. np.dot(A, B) * (np.linalg.norm(A) + np.linalg.norm(B))
B. np.dot(A, B) / (np.linalg.norm(A) - np.linalg.norm(B))
C. np.sum(A * B) / (np.linalg.norm(A) - np.linalg.norm(B))
D. np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of A and B divided by product of their norms.
Step 2: Match formula to code options
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches the formula exactly: np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)).
Final Answer:
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option D
Norm of vec1 = 1, norm of vec2 = 1, so cosine similarity = 0 / (1*1) = 0.
Final Answer:
0.00 -> Option A
Quick Check:
Orthogonal vectors have cosine similarity 0 [OK]
Hint: Orthogonal vectors have cosine similarity zero [OK]
Common Mistakes:
Confusing dot product with cosine similarity
Forgetting to divide by norms
Rounding errors causing wrong answer
4.
Consider this code snippet for similarity search. What is the error?
import numpy as np
vectors = [np.array([1, 2]), np.array([3, 4])]
query = np.array([1, 0])
scores = []
for v in vectors:
score = np.dot(query, v) / np.linalg.norm(query) * np.linalg.norm(v)
scores.append(score)
print(scores)
medium
A. Missing parentheses causing wrong order of operations
B. Using np.dot instead of np.cross
C. Vectors have different lengths
D. Query vector is not normalized
Solution
Step 1: Analyze the cosine similarity formula in code
The formula should divide dot product by product of norms: dot(query, v) / (norm(query) * norm(v)).
Step 2: Identify missing parentheses
Code does np.dot(query, v) / np.linalg.norm(query) * np.linalg.norm(v), which computes division then multiplication separately, causing wrong result.
Final Answer:
Missing parentheses causing wrong order of operations -> Option A
Quick Check:
Use parentheses to group denominator multiplication [OK]
Hint: Use parentheses to group denominator in cosine similarity [OK]
Common Mistakes:
Forgetting parentheses around denominator
Using cross product instead of dot product
Ignoring vector length mismatch
5.
You have a collection of text documents converted into vectors. You want to find the top 2 most similar documents to a new query vector using cosine similarity. Which approach is best?
Compute cosine similarity between query and each document vector.
A. scores = [np.dot(query, d) * np.linalg.norm(query) * np.linalg.norm(d) for d in docs]
top2 = sorted(scores)[:2]
print(top2)
B. scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
print(top2)
C. scores = [np.dot(query, d) / (np.linalg.norm(query) - np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i])[:2]
print(top2)
D. scores = [np.cross(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
print(top2)
Solution
Step 1: Compute cosine similarity correctly
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
print(top2) computes cosine similarity as dot product divided by product of norms, which is correct.
Step 2: Sort indices by similarity descending and select top 2
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
print(top2) sorts indices by scores descending and selects top 2, matching the requirement.
Final Answer:
scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
print(top2) -> Option B
Quick Check:
Cosine similarity + sort descending + top 2 = scores = [np.dot(query, d) / (np.linalg.norm(query) * np.linalg.norm(d)) for d in docs]
top2 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
print(top2) [OK]
Hint: Compute cosine similarity, sort descending, pick top results [OK]