Practice

(1/5)

1. Why do similarity measures help find related text in NLP?

easy

A. Because they compare numeric representations of texts to find closeness

B. Because they translate text into images for comparison

C. Because they count the number of words in each text

D. Because they randomly select texts to compare

Solution

Step 1: Understand text representation in NLP
Texts are converted into numbers (vectors) so computers can compare them easily.
Step 2: Role of similarity measures
Similarity measures calculate how close these numeric vectors are, showing relatedness.
Final Answer:
Because they compare numeric representations of texts to find closeness -> Option A
Quick Check:
Similarity = Numeric comparison [OK]

Hint: Similarity means comparing numbers, not words directly [OK]

Common Mistakes:

Thinking similarity compares raw words directly
Confusing similarity with random selection
Believing similarity translates text into images

2. Which of the following is the correct way to calculate cosine similarity between two vectors A and B in Python?

easy

A. cos_sim = np.linalg.norm(A - B)

B. cos_sim = np.sum(A + B)

C. cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

D. cos_sim = np.dot(A, B) * (np.linalg.norm(A) + np.linalg.norm(B))

Solution

Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their lengths.
Step 2: Match formula to code
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches this formula exactly using numpy functions.
Final Answer:
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option C
Quick Check:
Cosine similarity formula = cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]

Hint: Cosine similarity = dot product ÷ product of norms [OK]

Common Mistakes:

Adding vectors instead of dot product
Multiplying dot product by sum of norms
Using norm of difference instead of cosine similarity

3. Given two texts converted to sets of words: text1 = {'apple', 'banana', 'cherry'} and text2 = {'banana', 'cherry', 'date'}, what is the Jaccard similarity between them?

medium

A. 0.25

B. 0.6

C. 0.75

D. 0.5

Solution

Step 1: Calculate intersection and union of sets
Intersection = {'banana', 'cherry'} (2 items), Union = {'apple', 'banana', 'cherry', 'date'} (4 items).
Step 2: Compute Jaccard similarity
Jaccard similarity = size of intersection ÷ size of union = 2 ÷ 4 = 0.5.
Final Answer:
0.5 -> Option D
Quick Check:
Jaccard = intersection/union = 0.5 [OK]

Hint: Jaccard = common words ÷ total unique words [OK]

Common Mistakes:

Counting union incorrectly
Using sum instead of division
Confusing intersection with union size

4. The following Python code tries to compute cosine similarity but gives an error. What is the main issue?

import numpy as np
A = np.array([1, 2, 3])
B = np.array([4, 5])
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
print(cos_sim)

medium

A. np.linalg.norm is used incorrectly

B. Vectors A and B have different lengths causing dot product error

C. Division by zero error

D. Missing import statement for numpy

Solution

Step 1: Check vector sizes
Vector A has length 3, vector B has length 2, so dot product is invalid.
Step 2: Understand dot product requirements
Dot product requires vectors of same length; mismatch causes error.
Final Answer:
Vectors A and B have different lengths causing dot product error -> Option B
Quick Check:
Dot product needs equal length vectors [OK]

Hint: Dot product needs vectors of same length [OK]

Common Mistakes:

Assuming norm causes error
Thinking division by zero happened
Ignoring vector length mismatch

5. You want to find related news articles using similarity measures. Which approach best improves accuracy when articles have different lengths and some common words?

hard

A. Use cosine similarity on TF-IDF vectors to reduce common word impact

B. Use raw word counts and Jaccard similarity without preprocessing

C. Compare articles by counting total words only

D. Use random similarity scores to guess relatedness

Solution

Step 1: Understand TF-IDF role
TF-IDF reduces weight of common words, highlighting unique terms in articles.
Step 2: Why cosine similarity on TF-IDF helps
Cosine similarity measures angle between vectors, handling different lengths well.
Final Answer:
Use cosine similarity on TF-IDF vectors to reduce common word impact -> Option A
Quick Check:
TF-IDF + cosine similarity = better relatedness [OK]

Hint: TF-IDF + cosine similarity handles length and common words best [OK]

Common Mistakes:

Ignoring word importance by using raw counts
Using Jaccard without preprocessing
Relying on random scores

Why similarity measures find related text in NLP - The Real Reasons

Start learning this pattern below

Practice

Solution

Step 1: Understand text representation in NLP

Step 2: Role of similarity measures

Final Answer:

Quick Check:

Solution

Step 1: Recall cosine similarity formula

Step 2: Match formula to code

Final Answer:

Quick Check:

Solution

Step 1: Calculate intersection and union of sets

Step 2: Compute Jaccard similarity

Final Answer:

Quick Check:

Solution

Step 1: Check vector sizes

Step 2: Understand dot product requirements

Final Answer:

Quick Check:

Solution

Step 1: Understand TF-IDF role

Step 2: Why cosine similarity on TF-IDF helps

Final Answer:

Quick Check: