Imagine you have two documents represented as vectors. What does cosine similarity tell you about these documents?
Think about how vectors relate to each other in space.
Cosine similarity measures the cosine of the angle between two vectors. A smaller angle means the documents are more similar in content.
What is the output of this Python code that calculates cosine similarity between two document vectors?
import numpy as np doc1 = np.array([1, 2, 3]) doc2 = np.array([4, 5, 6]) cos_sim = np.dot(doc1, doc2) / (np.linalg.norm(doc1) * np.linalg.norm(doc2)) print(round(cos_sim, 2))
Calculate dot product and norms carefully.
The cosine similarity between [1,2,3] and [4,5,6] is approximately 0.97 after rounding.
You want to rank documents by meaning, not just word overlap. Which model is best for this?
Consider models that understand context and meaning.
Pretrained transformer embeddings capture semantic meaning better than simple count or TF-IDF models.
How does increasing the embedding vector size affect document similarity ranking?
Think about trade-offs between detail and complexity.
Larger embeddings can capture more information but may overfit or slow down similarity calculations.
You have a list of documents ranked by similarity to a query. Which metric best measures how well the ranking matches user relevance?
Consider metrics that evaluate ranked lists and relevance.
Precision at K measures how many of the top K ranked documents are relevant, which fits ranking evaluation.