Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is document similarity ranking in simple terms?
Document similarity ranking is a way to find and order documents based on how alike they are to a given document or query. It helps show the most relevant documents first, like sorting your photos by how similar they look.
Click to reveal answer
beginner
Name a common method to represent documents for similarity comparison.
A common method is to turn documents into vectors using techniques like TF-IDF or word embeddings. These vectors are like points in space that capture the meaning or important words of the documents.
Click to reveal answer
intermediate
How does cosine similarity help in document similarity ranking?
Cosine similarity measures the angle between two document vectors. If the angle is small, the documents are similar. It helps rank documents by how close their meanings are, ignoring length differences.
Click to reveal answer
intermediate
What role does TF-IDF play in document similarity?
TF-IDF scores words by how important they are in a document compared to all documents. It helps highlight unique words, making similarity ranking focus on meaningful content rather than common words like 'the' or 'and'.
Click to reveal answer
advanced
Why might word embeddings improve document similarity ranking over simple word counts?
Word embeddings capture the meaning and context of words, so documents with similar ideas but different words can still be ranked as similar. Simple counts miss this meaning and only see exact word matches.
Click to reveal answer
Which technique converts documents into vectors for similarity comparison?
ATF-IDF
BHTML parsing
CImage filtering
DSorting algorithms
✗ Incorrect
TF-IDF is a common method to convert documents into numerical vectors for similarity calculations.
What does cosine similarity measure between two document vectors?
AThe sum of word counts
BThe difference in length
CThe number of common words
DThe angle between vectors
✗ Incorrect
Cosine similarity measures the angle between two vectors to determine how similar their directions (meanings) are.
Why is TF-IDF useful in document similarity ranking?
AIt counts all words equally
BIt removes all punctuation
CIt highlights important words unique to documents
DIt translates documents to another language
✗ Incorrect
TF-IDF scores words higher if they are important and unique to a document, improving similarity ranking.
Which method captures the meaning of words for better similarity ranking?
ADocument length counting
BWord embeddings
CSpell checking
DStop word removal
✗ Incorrect
Word embeddings represent words in a way that captures their meaning and context.
In document similarity ranking, what is the main goal?
ATo order documents by how alike they are to a query
BTo count the number of pages in documents
CTo translate documents into images
DTo delete duplicate documents
✗ Incorrect
The main goal is to rank documents by their similarity to a given query or document.
Explain how document vectors and cosine similarity work together to rank documents by similarity.
Think about how documents become points in space and how we measure their closeness.
You got /4 concepts.
Describe why TF-IDF is important for improving document similarity ranking compared to just counting words.
Consider how common words affect similarity and how TF-IDF adjusts for that.
You got /4 concepts.
Practice
(1/5)
1. What does document similarity ranking help us do in natural language processing?
easy
A. Find how related two texts are based on their content
B. Translate documents into different languages
C. Summarize long documents into short ones
D. Detect spelling errors in documents
Solution
Step 1: Understand the purpose of document similarity ranking
Document similarity ranking is used to compare texts and find how closely related they are based on their content.
Step 2: Identify the correct description
Among the options, only finding relatedness of texts matches the purpose of document similarity ranking.
Final Answer:
Find how related two texts are based on their content -> Option A
Quick Check:
Document similarity ranking = Find related texts [OK]
Hint: Think: similarity means how close or related texts are [OK]
Common Mistakes:
Confusing similarity ranking with translation
Thinking it summarizes documents
Mixing it up with spell checking
2. Which of the following is the correct way to compute cosine similarity between two vectors A and B in Python using NumPy?
easy
A. np.dot(A, B) * (np.linalg.norm(A) + np.linalg.norm(B))
B. np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
C. np.dot(A, B) - (np.linalg.norm(A) * np.linalg.norm(B))
D. np.dot(A, B) / (np.linalg.norm(A) + np.linalg.norm(B))
Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their magnitudes (norms).
Step 2: Match formula to code
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) correctly implements this formula using np.dot and np.linalg.norm.
Final Answer:
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option B
Quick Check:
Cosine similarity formula = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
Hint: Cosine similarity = dot product ÷ (norm A x norm B) [OK]
Common Mistakes:
Adding norms instead of multiplying
Subtracting norms instead of dividing
Multiplying dot product by sum of norms
3. Given the following Python code using TF-IDF and cosine similarity, what will be the printed similarity score between the two documents?
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
docs = ['apple orange banana', 'banana fruit apple']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
sim_score = cosine_similarity(X[0], X[1])[0][0]
print(round(sim_score, 2))
medium
A. 0.50
B. 1.00
C. 0.58
D. 0.00
Solution
Step 1: Understand TF-IDF vectorization of similar documents
Both documents share words 'apple' and 'banana' and have similar content, so their TF-IDF vectors will be close.
Step 2: Calculate cosine similarity between vectors
Cosine similarity between these vectors will be high but less than 1, approximately 0.58 after rounding.
Final Answer:
0.58 -> Option C
Quick Check:
Similarity of similar docs ≈ 0.58 [OK]
Hint: Similar docs have cosine similarity close to 1 but not exactly 1 [OK]
Common Mistakes:
Assuming similarity is exactly 1 for similar texts
Confusing cosine similarity with Euclidean distance
Ignoring TF-IDF weighting effects
4. The following code attempts to compute cosine similarity between two documents but raises an error. What is the main issue?
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
docs = ['cat dog', 'dog mouse']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs).toarray()
sim_score = cosine_similarity(X[0], X[1])
print(sim_score)
medium
A. cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays
B. TfidfVectorizer cannot process documents with different words
C. cosine_similarity requires dense arrays, not sparse matrices
D. The print statement has a typo in variable name
Solution
Step 1: Check input types for cosine_similarity
cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays (shape (n_features,)).
Step 2: Understand how to fix the error
Use X[0:1] and X[1:2] or reshape them properly to avoid the error.
Final Answer:
cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays -> Option A
Quick Check:
cosine_similarity input shape = 2D arrays [OK]
Hint: cosine_similarity needs 2D arrays, not single vectors [OK]
Common Mistakes:
Thinking TfidfVectorizer fails on different words
Thinking cosine_similarity accepts 1D arrays
Overlooking variable name typos
5. You have a collection of 3 documents: ['apple banana', 'banana orange', 'apple orange banana']. You want to rank these documents by similarity to the query 'banana apple'. Which approach correctly ranks them from most to least similar using TF-IDF and cosine similarity?
hard
A. Use raw word counts without TF-IDF, rank by Euclidean distance ascending
B. Count word overlaps between query and documents, rank by overlap count ascending
C. Compute TF-IDF vectors but rank by cosine similarity scores ascending
D. Compute TF-IDF vectors for all documents and query, then rank by cosine similarity scores descending
Solution
Step 1: Understand ranking by similarity
To rank documents by similarity to a query, compute vector representations and measure similarity scores, then sort descending (highest similarity first).
Step 2: Identify correct method
TF-IDF vectors and cosine similarity are standard; ranking by descending cosine similarity scores is correct.
Final Answer:
Compute TF-IDF vectors for all documents and query, then rank by cosine similarity scores descending -> Option D