When we want to find related text, we measure how close or similar two pieces of text are. The key metrics are Cosine Similarity and Jaccard Similarity. Cosine similarity measures the angle between two text vectors, showing how similar their meaning is regardless of length. Jaccard similarity compares shared words or features. These metrics help us find texts that talk about the same ideas or topics.
Why similarity measures find related text in NLP - Why Metrics Matter
Start learning this pattern below
Jump into concepts and practice - no test required
Related Text Pairs (Positive) vs Not Related (Negative):
Predicted Related Predicted Not Related
Actual Related TP = 80 FN = 20
Actual Not Related FP = 15 TN = 85
Total samples = 200
From this:
Precision = TP / (TP + FP) = 80 / (80 + 15) = 0.842
Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.8
F1 Score = 2 * (0.842 * 0.8) / (0.842 + 0.8) ≈ 0.82
This shows how well similarity measures help find truly related text pairs.If we want to find related text, sometimes we want to be very sure the pairs we find are truly related (high precision). For example, in a legal document search, wrong matches waste time.
Other times, we want to find as many related texts as possible (high recall). For example, in research, missing related papers is bad.
Improving precision may lower recall and vice versa. Choosing the right balance depends on the task.
Good: Precision and recall both above 0.8 means most found pairs are truly related and most related pairs are found.
Bad: Precision below 0.5 means many unrelated pairs are marked related. Recall below 0.5 means many related pairs are missed.
For similarity measures, a good threshold to decide relatedness is key to get good precision and recall.
- Accuracy paradox: If most text pairs are unrelated, a model that always says "not related" can have high accuracy but is useless.
- Data leakage: Using the same text in training and testing can inflate similarity scores.
- Overfitting: Tuning similarity thresholds too closely on one dataset may not work on new texts.
Your similarity model finds related text pairs with 98% accuracy but only 12% recall. Is it good for finding related texts? Why or why not?
Answer: No, because it misses most related pairs (low recall). It finds very few related texts even if it is usually correct when it does. For related text search, missing many related pairs is a big problem.
Practice
Solution
Step 1: Understand text representation in NLP
Texts are converted into numbers (vectors) so computers can compare them easily.Step 2: Role of similarity measures
Similarity measures calculate how close these numeric vectors are, showing relatedness.Final Answer:
Because they compare numeric representations of texts to find closeness -> Option AQuick Check:
Similarity = Numeric comparison [OK]
- Thinking similarity compares raw words directly
- Confusing similarity with random selection
- Believing similarity translates text into images
A and B in Python?Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their lengths.Step 2: Match formula to code
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches this formula exactly using numpy functions.Final Answer:
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option CQuick Check:
Cosine similarity formula = cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
- Adding vectors instead of dot product
- Multiplying dot product by sum of norms
- Using norm of difference instead of cosine similarity
text1 = {'apple', 'banana', 'cherry'} and text2 = {'banana', 'cherry', 'date'}, what is the Jaccard similarity between them?Solution
Step 1: Calculate intersection and union of sets
Intersection = {'banana', 'cherry'} (2 items), Union = {'apple', 'banana', 'cherry', 'date'} (4 items).Step 2: Compute Jaccard similarity
Jaccard similarity = size of intersection ÷ size of union = 2 ÷ 4 = 0.5.Final Answer:
0.5 -> Option DQuick Check:
Jaccard = intersection/union = 0.5 [OK]
- Counting union incorrectly
- Using sum instead of division
- Confusing intersection with union size
import numpy as np A = np.array([1, 2, 3]) B = np.array([4, 5]) cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) print(cos_sim)
Solution
Step 1: Check vector sizes
Vector A has length 3, vector B has length 2, so dot product is invalid.Step 2: Understand dot product requirements
Dot product requires vectors of same length; mismatch causes error.Final Answer:
Vectors A and B have different lengths causing dot product error -> Option BQuick Check:
Dot product needs equal length vectors [OK]
- Assuming norm causes error
- Thinking division by zero happened
- Ignoring vector length mismatch
Solution
Step 1: Understand TF-IDF role
TF-IDF reduces weight of common words, highlighting unique terms in articles.Step 2: Why cosine similarity on TF-IDF helps
Cosine similarity measures angle between vectors, handling different lengths well.Final Answer:
Use cosine similarity on TF-IDF vectors to reduce common word impact -> Option AQuick Check:
TF-IDF + cosine similarity = better relatedness [OK]
- Ignoring word importance by using raw counts
- Using Jaccard without preprocessing
- Relying on random scores
