TF-IDF is a common technique used in information retrieval. What does it primarily help with?
Think about how TF-IDF balances word frequency with rarity across documents.
TF-IDF stands for Term Frequency-Inverse Document Frequency. It scores words higher if they appear often in a document but rarely across all documents, helping to find important words.
Given two vectors representing documents, what is the cosine similarity output?
import numpy as np def cosine_similarity(vec1, vec2): dot_product = np.dot(vec1, vec2) norm1 = np.linalg.norm(vec1) norm2 = np.linalg.norm(vec2) return dot_product / (norm1 * norm2) vec1 = np.array([1, 2, 3]) vec2 = np.array([4, 5, 6]) result = cosine_similarity(vec1, vec2) print(round(result, 2))
Recall cosine similarity formula and calculate dot product and norms carefully.
Cosine similarity is the dot product of vectors divided by the product of their magnitudes. For these vectors, the value is approximately 0.9746, rounded to 0.97.
You want to build a search system that understands the meaning of queries and documents beyond exact word matches. Which model would you choose?
Consider models that capture context and meaning, not just word counts.
Transformer-based models like BERT produce embeddings that capture semantic meaning, enabling better understanding of queries and documents for semantic search.
In a k-Nearest Neighbors (k-NN) model used for retrieving similar documents, which hyperparameter controls how many neighbors are checked?
Think about the 'k' in k-NN and what it stands for.
The hyperparameter 'k' specifies how many nearest neighbors the model considers when making a retrieval or classification decision.
You want to measure how well your search engine ranks relevant documents higher than irrelevant ones. Which metric is most appropriate?
Consider metrics that account for both relevance and position in the ranked list.
NDCG measures ranking quality by considering the relevance of documents and their positions, rewarding relevant documents appearing higher in the list.