Bird
Raised Fist0
NLPml~10 mins

Document similarity ranking in NLP - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to compute cosine similarity between two vectors.

NLP
from sklearn.metrics.pairwise import [1]

vec1 = [[1, 2, 3]]
vec2 = [[4, 5, 6]]
similarity = [1](vec1, vec2)
print(similarity)
Drag options to blanks, or click blank then click option'
Acosine_similarity
Beuclidean_distance
Cmanhattan_distance
Ddot_product
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using distance functions like euclidean_distance instead of similarity.
Trying to use dot_product which is not a sklearn function.
2fill in blank
medium

Complete the code to convert text documents into TF-IDF vectors.

NLP
from sklearn.feature_extraction.text import [1]

corpus = ['I love machine learning', 'Machine learning is fun']
tfidf_vectorizer = [1]()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
Drag options to blanks, or click blank then click option'
AHashingVectorizer
BTfidfVectorizer
CCountVectorizer
DDictVectorizer
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using CountVectorizer which only counts word occurrences.
Using HashingVectorizer which does not compute TF-IDF.
3fill in blank
hard

Fix the error in the code to correctly compute similarity scores between documents.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import [1]

texts = ['Data science is cool', 'I love data science']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(texts)
scores = [1](matrix)
print(scores)
Drag options to blanks, or click blank then click option'
Amanhattan_distances
Beuclidean_distances
Cpairwise_distances
Dcosine_similarity
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using distance functions which give dissimilarity scores.
Passing the wrong matrix shape to the function.
4fill in blank
hard

Fill both blanks to create a dictionary of document similarity scores above a threshold.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ['AI is the future', 'AI and ML are related', 'I enjoy sports']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(texts)
sim_matrix = cosine_similarity(matrix)

threshold = 0.5
similar_docs = {i: [j for j in range(len(texts)) if sim_matrix[i][j] [1] threshold and i != j] for i in range(len(texts)) if any(sim_matrix[i][j] [2] threshold for j in range(len(texts)))}
print(similar_docs)
Drag options to blanks, or click blank then click option'
A>
B>=
C<
D<=
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using '<' or '<=' which would select less similar documents.
Mixing different operators in the two blanks.
5fill in blank
hard

Fill all three blanks to rank documents by similarity to a query document.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = ['Deep learning is powerful', 'I like deep learning', 'Cats are cute']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)

query = ['I love learning']
query_vec = vectorizer.transform(query)
sim_scores = cosine_similarity(query_vec, matrix)[0]

ranked_docs = sorted(((i, sim_scores[i]) for i in range(len(corpus))), key=lambda x: x[1] x[2], reverse=[3])
print(ranked_docs)
Drag options to blanks, or click blank then click option'
A*
B-
C>
DTrue
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using '*' or '+' in the key which does not sort properly.
Setting reverse=False which sorts ascending.

Practice

(1/5)
1. What does document similarity ranking help us do in natural language processing?
easy
A. Find how related two texts are based on their content
B. Translate documents into different languages
C. Summarize long documents into short ones
D. Detect spelling errors in documents

Solution

  1. Step 1: Understand the purpose of document similarity ranking

    Document similarity ranking is used to compare texts and find how closely related they are based on their content.
  2. Step 2: Identify the correct description

    Among the options, only finding relatedness of texts matches the purpose of document similarity ranking.
  3. Final Answer:

    Find how related two texts are based on their content -> Option A
  4. Quick Check:

    Document similarity ranking = Find related texts [OK]
Hint: Think: similarity means how close or related texts are [OK]
Common Mistakes:
  • Confusing similarity ranking with translation
  • Thinking it summarizes documents
  • Mixing it up with spell checking
2. Which of the following is the correct way to compute cosine similarity between two vectors A and B in Python using NumPy?
easy
A. np.dot(A, B) * (np.linalg.norm(A) + np.linalg.norm(B))
B. np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
C. np.dot(A, B) - (np.linalg.norm(A) * np.linalg.norm(B))
D. np.dot(A, B) / (np.linalg.norm(A) + np.linalg.norm(B))

Solution

  1. Step 1: Recall cosine similarity formula

    Cosine similarity = dot product of vectors divided by product of their magnitudes (norms).
  2. Step 2: Match formula to code

    np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) correctly implements this formula using np.dot and np.linalg.norm.
  3. Final Answer:

    np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option B
  4. Quick Check:

    Cosine similarity formula = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
Hint: Cosine similarity = dot product รท (norm A x norm B) [OK]
Common Mistakes:
  • Adding norms instead of multiplying
  • Subtracting norms instead of dividing
  • Multiplying dot product by sum of norms
3. Given the following Python code using TF-IDF and cosine similarity, what will be the printed similarity score between the two documents?
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = ['apple orange banana', 'banana fruit apple']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
sim_score = cosine_similarity(X[0], X[1])[0][0]
print(round(sim_score, 2))
medium
A. 0.50
B. 1.00
C. 0.58
D. 0.00

Solution

  1. Step 1: Understand TF-IDF vectorization of similar documents

    Both documents share words 'apple' and 'banana' and have similar content, so their TF-IDF vectors will be close.
  2. Step 2: Calculate cosine similarity between vectors

    Cosine similarity between these vectors will be high but less than 1, approximately 0.58 after rounding.
  3. Final Answer:

    0.58 -> Option C
  4. Quick Check:

    Similarity of similar docs โ‰ˆ 0.58 [OK]
Hint: Similar docs have cosine similarity close to 1 but not exactly 1 [OK]
Common Mistakes:
  • Assuming similarity is exactly 1 for similar texts
  • Confusing cosine similarity with Euclidean distance
  • Ignoring TF-IDF weighting effects
4. The following code attempts to compute cosine similarity between two documents but raises an error. What is the main issue?
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = ['cat dog', 'dog mouse']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs).toarray()
sim_score = cosine_similarity(X[0], X[1])
print(sim_score)
medium
A. cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays
B. TfidfVectorizer cannot process documents with different words
C. cosine_similarity requires dense arrays, not sparse matrices
D. The print statement has a typo in variable name

Solution

  1. Step 1: Check input types for cosine_similarity

    cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays (shape (n_features,)).
  2. Step 2: Understand how to fix the error

    Use X[0:1] and X[1:2] or reshape them properly to avoid the error.
  3. Final Answer:

    cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays -> Option A
  4. Quick Check:

    cosine_similarity input shape = 2D arrays [OK]
Hint: cosine_similarity needs 2D arrays, not single vectors [OK]
Common Mistakes:
  • Thinking TfidfVectorizer fails on different words
  • Thinking cosine_similarity accepts 1D arrays
  • Overlooking variable name typos
5. You have a collection of 3 documents: ['apple banana', 'banana orange', 'apple orange banana']. You want to rank these documents by similarity to the query 'banana apple'. Which approach correctly ranks them from most to least similar using TF-IDF and cosine similarity?
hard
A. Use raw word counts without TF-IDF, rank by Euclidean distance ascending
B. Count word overlaps between query and documents, rank by overlap count ascending
C. Compute TF-IDF vectors but rank by cosine similarity scores ascending
D. Compute TF-IDF vectors for all documents and query, then rank by cosine similarity scores descending

Solution

  1. Step 1: Understand ranking by similarity

    To rank documents by similarity to a query, compute vector representations and measure similarity scores, then sort descending (highest similarity first).
  2. Step 2: Identify correct method

    TF-IDF vectors and cosine similarity are standard; ranking by descending cosine similarity scores is correct.
  3. Final Answer:

    Compute TF-IDF vectors for all documents and query, then rank by cosine similarity scores descending -> Option D
  4. Quick Check:

    Similarity ranking = cosine similarity descending [OK]
Hint: Rank documents by highest cosine similarity to query [OK]
Common Mistakes:
  • Ranking by ascending similarity (lowest first)
  • Using raw counts without weighting
  • Ranking by overlap count ascending