Similarity measures help us find text pieces that talk about the same or similar things. They make it easy to group or compare texts without reading everything.
Why similarity measures find related text in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
similarity_score = similarity_measure(text1_vector, text2_vector)
Text must be converted into numbers (vectors) before measuring similarity.
Common similarity measures include cosine similarity, Jaccard similarity, and Euclidean distance.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity texts = ['I love apples', 'I like apples', 'I hate bananas'] vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(texts) score = cosine_similarity(vectors[0], vectors[1]) print(score[0][0])
text1 = 'cat dog' text2 = 'dog mouse' set1 = set(text1.split()) set2 = set(text2.split()) jaccard = len(set1 & set2) / len(set1 | set2) print(jaccard)
This program shows how similarity scores are higher for related texts and lower for unrelated ones.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity texts = [ 'Machine learning is fun', 'I enjoy learning about machines', 'The sky is blue today' ] vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(texts) # Calculate similarity between first and second text score_0_1 = cosine_similarity(vectors[0], vectors[1])[0][0] # Calculate similarity between first and third text score_0_2 = cosine_similarity(vectors[0], vectors[2])[0][0] print(f'Similarity between text 0 and 1: {score_0_1:.2f}') print(f'Similarity between text 0 and 2: {score_0_2:.2f}')
Similarity scores usually range from 0 (no similarity) to 1 (identical).
Choosing the right similarity measure depends on your text and task.
Preprocessing text (like lowercasing, removing stopwords) can improve similarity results.
Similarity measures help find related texts by comparing their numeric forms.
They are useful in many real-life tasks like recommendations and grouping.
Cosine similarity and Jaccard similarity are common and easy to use.
Practice
Solution
Step 1: Understand text representation in NLP
Texts are converted into numbers (vectors) so computers can compare them easily.Step 2: Role of similarity measures
Similarity measures calculate how close these numeric vectors are, showing relatedness.Final Answer:
Because they compare numeric representations of texts to find closeness -> Option AQuick Check:
Similarity = Numeric comparison [OK]
- Thinking similarity compares raw words directly
- Confusing similarity with random selection
- Believing similarity translates text into images
A and B in Python?Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their lengths.Step 2: Match formula to code
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches this formula exactly using numpy functions.Final Answer:
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option CQuick Check:
Cosine similarity formula = cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
- Adding vectors instead of dot product
- Multiplying dot product by sum of norms
- Using norm of difference instead of cosine similarity
text1 = {'apple', 'banana', 'cherry'} and text2 = {'banana', 'cherry', 'date'}, what is the Jaccard similarity between them?Solution
Step 1: Calculate intersection and union of sets
Intersection = {'banana', 'cherry'} (2 items), Union = {'apple', 'banana', 'cherry', 'date'} (4 items).Step 2: Compute Jaccard similarity
Jaccard similarity = size of intersection ÷ size of union = 2 ÷ 4 = 0.5.Final Answer:
0.5 -> Option DQuick Check:
Jaccard = intersection/union = 0.5 [OK]
- Counting union incorrectly
- Using sum instead of division
- Confusing intersection with union size
import numpy as np A = np.array([1, 2, 3]) B = np.array([4, 5]) cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) print(cos_sim)
Solution
Step 1: Check vector sizes
Vector A has length 3, vector B has length 2, so dot product is invalid.Step 2: Understand dot product requirements
Dot product requires vectors of same length; mismatch causes error.Final Answer:
Vectors A and B have different lengths causing dot product error -> Option BQuick Check:
Dot product needs equal length vectors [OK]
- Assuming norm causes error
- Thinking division by zero happened
- Ignoring vector length mismatch
Solution
Step 1: Understand TF-IDF role
TF-IDF reduces weight of common words, highlighting unique terms in articles.Step 2: Why cosine similarity on TF-IDF helps
Cosine similarity measures angle between vectors, handling different lengths well.Final Answer:
Use cosine similarity on TF-IDF vectors to reduce common word impact -> Option AQuick Check:
TF-IDF + cosine similarity = better relatedness [OK]
- Ignoring word importance by using raw counts
- Using Jaccard without preprocessing
- Relying on random scores
