0
0
NLPml~20 mins

Document similarity ranking in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Document Similarity Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
What does cosine similarity measure in document ranking?

Imagine you have two documents represented as vectors. What does cosine similarity tell you about these documents?

AThe angle between the two document vectors, indicating how similar their content is.
BThe total number of common words between the two documents.
CThe difference in length between the two documents measured by word count.
DThe sum of the frequencies of all words in both documents.
Attempts:
2 left
💡 Hint

Think about how vectors relate to each other in space.

Predict Output
intermediate
1:30remaining
Output of cosine similarity calculation

What is the output of this Python code that calculates cosine similarity between two document vectors?

NLP
import numpy as np

doc1 = np.array([1, 2, 3])
doc2 = np.array([4, 5, 6])

cos_sim = np.dot(doc1, doc2) / (np.linalg.norm(doc1) * np.linalg.norm(doc2))
print(round(cos_sim, 2))
A0.75
B1.00
C0.97
D0.87
Attempts:
2 left
💡 Hint

Calculate dot product and norms carefully.

Model Choice
advanced
2:00remaining
Best model for semantic document similarity

You want to rank documents by meaning, not just word overlap. Which model is best for this?

ABag-of-words model with Euclidean distance
BCount vectorizer with Jaccard similarity
CTF-IDF vectorizer with cosine similarity
DPretrained transformer embeddings with cosine similarity
Attempts:
2 left
💡 Hint

Consider models that understand context and meaning.

Hyperparameter
advanced
2:00remaining
Effect of embedding dimension on similarity ranking

How does increasing the embedding vector size affect document similarity ranking?

AIt can improve accuracy but may cause overfitting or slow computation.
BIt reduces accuracy because larger vectors are harder to compare.
CIt has no effect on similarity ranking performance.
DIt always improves ranking accuracy by capturing more details.
Attempts:
2 left
💡 Hint

Think about trade-offs between detail and complexity.

Metrics
expert
2:30remaining
Choosing the best metric for ranking evaluation

You have a list of documents ranked by similarity to a query. Which metric best measures how well the ranking matches user relevance?

AMean Squared Error (MSE)
BPrecision at K (P@K)
CRoot Mean Squared Logarithmic Error (RMSLE)
DConfusion Matrix
Attempts:
2 left
💡 Hint

Consider metrics that evaluate ranked lists and relevance.