Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the main purpose of similarity measures in text analysis?
Similarity measures help find how close or related two pieces of text are by comparing their features, like words or meanings.
Click to reveal answer
beginner
How do similarity measures represent text to compare them?
They convert text into numbers or vectors, such as word counts or embeddings, so they can be mathematically compared.
Click to reveal answer
intermediate
Why does cosine similarity work well for finding related text?
Cosine similarity measures the angle between two text vectors, showing how similar their directions are regardless of length, which captures relatedness well.
Click to reveal answer
intermediate
What role does word meaning play in similarity measures like embeddings?
Embeddings capture word meanings in numbers, so similarity measures using embeddings find related text by comparing meanings, not just exact words.
Click to reveal answer
advanced
Can similarity measures find related text even if words are different? How?
Yes, by using semantic representations like embeddings, similarity measures can find related text even if the exact words differ but the meanings are close.
Click to reveal answer
What do similarity measures compare to find related text?
AThe font style of the text
BThe length of the text only
CNumerical representations of text
DThe number of sentences
✗ Incorrect
Similarity measures compare numerical forms like vectors to find how related texts are.
Which similarity measure uses the angle between vectors?
AEuclidean distance
BCosine similarity
CJaccard index
DManhattan distance
✗ Incorrect
Cosine similarity measures the angle between vectors to find similarity.
Why are embeddings useful for similarity in text?
AThey capture word meanings as numbers
BThey shorten the text length
CThey translate text to another language
DThey count word frequency
✗ Incorrect
Embeddings represent word meanings numerically, helping find related text by meaning.
Can similarity measures find related text if words differ but meanings are similar?
AYes, with semantic representations
BNo, only exact words match
COnly if texts have same length
DOnly if texts have same punctuation
✗ Incorrect
Semantic representations like embeddings allow similarity measures to find related text beyond exact words.
What is a simple way to represent text for similarity comparison?
AAs video clips
BAs images
CAs audio files
DAs vectors of numbers
✗ Incorrect
Text is converted into vectors of numbers to compare similarity.
Explain why similarity measures can find related text even if the exact words differ.
Think about how word meanings are captured beyond just the words themselves.
You got /4 concepts.
Describe how cosine similarity helps in finding related text.
Focus on what cosine similarity measures mathematically.
You got /4 concepts.
Practice
(1/5)
1. Why do similarity measures help find related text in NLP?
easy
A. Because they compare numeric representations of texts to find closeness
B. Because they translate text into images for comparison
C. Because they count the number of words in each text
D. Because they randomly select texts to compare
Solution
Step 1: Understand text representation in NLP
Texts are converted into numbers (vectors) so computers can compare them easily.
Step 2: Role of similarity measures
Similarity measures calculate how close these numeric vectors are, showing relatedness.
Final Answer:
Because they compare numeric representations of texts to find closeness -> Option A
Quick Check:
Similarity = Numeric comparison [OK]
Hint: Similarity means comparing numbers, not words directly [OK]
Common Mistakes:
Thinking similarity compares raw words directly
Confusing similarity with random selection
Believing similarity translates text into images
2. Which of the following is the correct way to calculate cosine similarity between two vectors A and B in Python?
easy
A. cos_sim = np.linalg.norm(A - B)
B. cos_sim = np.sum(A + B)
C. cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
D. cos_sim = np.dot(A, B) * (np.linalg.norm(A) + np.linalg.norm(B))
Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their lengths.
Step 2: Match formula to code
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches this formula exactly using numpy functions.
Final Answer:
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option C
Quick Check:
Cosine similarity formula = cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
Using norm of difference instead of cosine similarity
3. Given two texts converted to sets of words: text1 = {'apple', 'banana', 'cherry'} and text2 = {'banana', 'cherry', 'date'}, what is the Jaccard similarity between them?
Jaccard similarity = size of intersection ÷ size of union = 2 ÷ 4 = 0.5.
Final Answer:
0.5 -> Option D
Quick Check:
Jaccard = intersection/union = 0.5 [OK]
Hint: Jaccard = common words ÷ total unique words [OK]
Common Mistakes:
Counting union incorrectly
Using sum instead of division
Confusing intersection with union size
4. The following Python code tries to compute cosine similarity but gives an error. What is the main issue?
import numpy as np
A = np.array([1, 2, 3])
B = np.array([4, 5])
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
print(cos_sim)
medium
A. np.linalg.norm is used incorrectly
B. Vectors A and B have different lengths causing dot product error
C. Division by zero error
D. Missing import statement for numpy
Solution
Step 1: Check vector sizes
Vector A has length 3, vector B has length 2, so dot product is invalid.
Step 2: Understand dot product requirements
Dot product requires vectors of same length; mismatch causes error.
Final Answer:
Vectors A and B have different lengths causing dot product error -> Option B
Quick Check:
Dot product needs equal length vectors [OK]
Hint: Dot product needs vectors of same length [OK]
Common Mistakes:
Assuming norm causes error
Thinking division by zero happened
Ignoring vector length mismatch
5. You want to find related news articles using similarity measures. Which approach best improves accuracy when articles have different lengths and some common words?
hard
A. Use cosine similarity on TF-IDF vectors to reduce common word impact
B. Use raw word counts and Jaccard similarity without preprocessing
C. Compare articles by counting total words only
D. Use random similarity scores to guess relatedness
Solution
Step 1: Understand TF-IDF role
TF-IDF reduces weight of common words, highlighting unique terms in articles.
Step 2: Why cosine similarity on TF-IDF helps
Cosine similarity measures angle between vectors, handling different lengths well.
Final Answer:
Use cosine similarity on TF-IDF vectors to reduce common word impact -> Option A