Bird
Raised Fist0
NLPml~20 mins

Why similarity measures find related text in NLP - Challenge Your Understanding

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
๐ŸŽ–๏ธ
Similarity Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
๐Ÿง  Conceptual
intermediate
2:00remaining
Why do cosine similarity scores close to 1 indicate related text?

Cosine similarity measures the angle between two text vectors. Why does a score close to 1 mean the texts are related?

ABecause the vectors point in very similar directions, showing similar word usage patterns.
BBecause the vectors have very different lengths, indicating unrelated content.
CBecause the vectors are orthogonal, meaning they share no common words.
DBecause the vectors have zero magnitude, so similarity is undefined.
Attempts:
2 left
๐Ÿ’ก Hint

Think about what it means when two arrows point the same way.

โ“ Predict Output
intermediate
2:00remaining
Output of cosine similarity between two text vectors

What is the output of the following code that computes cosine similarity between two text vectors?

NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ['apple orange banana', 'banana orange apple']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
sim = cosine_similarity(X[0], X[1])
print(round(sim[0][0], 2))
A0.0
B0.5
C1.0
D0.33
Attempts:
2 left
๐Ÿ’ก Hint

Both texts have the same words but in different order.

โ“ Model Choice
advanced
2:00remaining
Best similarity measure for short text snippets

You want to find relatedness between very short texts like tweets. Which similarity measure is best?

AJaccard similarity on sets of words
BEuclidean distance on raw word counts
CManhattan distance on character counts
DCosine similarity on TF-IDF vectors
Attempts:
2 left
๐Ÿ’ก Hint

Consider a measure that accounts for word importance and ignores length differences.

โ“ Hyperparameter
advanced
2:00remaining
Effect of stopword removal on similarity scores

How does removing stopwords before vectorizing text affect similarity scores?

AIt increases similarity scores by focusing on meaningful words.
BIt decreases similarity scores by removing common words that link texts.
CIt has no effect because stopwords are ignored by similarity measures.
DIt causes errors because vectors become empty.
Attempts:
2 left
๐Ÿ’ก Hint

Think about what words carry meaning in text.

๐Ÿ”ง Debug
expert
2:00remaining
Why does this similarity code produce zero similarity for related texts?

Given two related texts, this code outputs zero similarity. What is the cause?

NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ['cat and dog', 'dog and cat']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
sim = cosine_similarity(X[0], X[1])
print(sim[0][0])
AThe vectors are sparse matrices and need to be converted to dense arrays before similarity.
BThe code is correct and should output 1.0; zero means an environment error.
CThe cosine_similarity function expects 1D arrays, but gets 2D sparse matrices causing zero output.
DThe CountVectorizer default token pattern excludes all words, resulting in empty vectors.
Attempts:
2 left
๐Ÿ’ก Hint

Check if the code runs as expected in a normal Python environment.

Practice

(1/5)
1. Why do similarity measures help find related text in NLP?
easy
A. Because they compare numeric representations of texts to find closeness
B. Because they translate text into images for comparison
C. Because they count the number of words in each text
D. Because they randomly select texts to compare

Solution

  1. Step 1: Understand text representation in NLP

    Texts are converted into numbers (vectors) so computers can compare them easily.
  2. Step 2: Role of similarity measures

    Similarity measures calculate how close these numeric vectors are, showing relatedness.
  3. Final Answer:

    Because they compare numeric representations of texts to find closeness -> Option A
  4. Quick Check:

    Similarity = Numeric comparison [OK]
Hint: Similarity means comparing numbers, not words directly [OK]
Common Mistakes:
  • Thinking similarity compares raw words directly
  • Confusing similarity with random selection
  • Believing similarity translates text into images
2. Which of the following is the correct way to calculate cosine similarity between two vectors A and B in Python?
easy
A. cos_sim = np.linalg.norm(A - B)
B. cos_sim = np.sum(A + B)
C. cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
D. cos_sim = np.dot(A, B) * (np.linalg.norm(A) + np.linalg.norm(B))

Solution

  1. Step 1: Recall cosine similarity formula

    Cosine similarity = dot product of vectors divided by product of their lengths.
  2. Step 2: Match formula to code

    cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) matches this formula exactly using numpy functions.
  3. Final Answer:

    cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option C
  4. Quick Check:

    Cosine similarity formula = cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
Hint: Cosine similarity = dot product รท product of norms [OK]
Common Mistakes:
  • Adding vectors instead of dot product
  • Multiplying dot product by sum of norms
  • Using norm of difference instead of cosine similarity
3. Given two texts converted to sets of words: text1 = {'apple', 'banana', 'cherry'} and text2 = {'banana', 'cherry', 'date'}, what is the Jaccard similarity between them?
medium
A. 0.25
B. 0.6
C. 0.75
D. 0.5

Solution

  1. Step 1: Calculate intersection and union of sets

    Intersection = {'banana', 'cherry'} (2 items), Union = {'apple', 'banana', 'cherry', 'date'} (4 items).
  2. Step 2: Compute Jaccard similarity

    Jaccard similarity = size of intersection รท size of union = 2 รท 4 = 0.5.
  3. Final Answer:

    0.5 -> Option D
  4. Quick Check:

    Jaccard = intersection/union = 0.5 [OK]
Hint: Jaccard = common words รท total unique words [OK]
Common Mistakes:
  • Counting union incorrectly
  • Using sum instead of division
  • Confusing intersection with union size
4. The following Python code tries to compute cosine similarity but gives an error. What is the main issue?
import numpy as np
A = np.array([1, 2, 3])
B = np.array([4, 5])
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
print(cos_sim)
medium
A. np.linalg.norm is used incorrectly
B. Vectors A and B have different lengths causing dot product error
C. Division by zero error
D. Missing import statement for numpy

Solution

  1. Step 1: Check vector sizes

    Vector A has length 3, vector B has length 2, so dot product is invalid.
  2. Step 2: Understand dot product requirements

    Dot product requires vectors of same length; mismatch causes error.
  3. Final Answer:

    Vectors A and B have different lengths causing dot product error -> Option B
  4. Quick Check:

    Dot product needs equal length vectors [OK]
Hint: Dot product needs vectors of same length [OK]
Common Mistakes:
  • Assuming norm causes error
  • Thinking division by zero happened
  • Ignoring vector length mismatch
5. You want to find related news articles using similarity measures. Which approach best improves accuracy when articles have different lengths and some common words?
hard
A. Use cosine similarity on TF-IDF vectors to reduce common word impact
B. Use raw word counts and Jaccard similarity without preprocessing
C. Compare articles by counting total words only
D. Use random similarity scores to guess relatedness

Solution

  1. Step 1: Understand TF-IDF role

    TF-IDF reduces weight of common words, highlighting unique terms in articles.
  2. Step 2: Why cosine similarity on TF-IDF helps

    Cosine similarity measures angle between vectors, handling different lengths well.
  3. Final Answer:

    Use cosine similarity on TF-IDF vectors to reduce common word impact -> Option A
  4. Quick Check:

    TF-IDF + cosine similarity = better relatedness [OK]
Hint: TF-IDF + cosine similarity handles length and common words best [OK]
Common Mistakes:
  • Ignoring word importance by using raw counts
  • Using Jaccard without preprocessing
  • Relying on random scores