0
0
NLPml~20 mins

Why similarity measures find related text in NLP - Challenge Your Understanding

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Similarity Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why do cosine similarity scores close to 1 indicate related text?

Cosine similarity measures the angle between two text vectors. Why does a score close to 1 mean the texts are related?

ABecause the vectors point in very similar directions, showing similar word usage patterns.
BBecause the vectors have very different lengths, indicating unrelated content.
CBecause the vectors are orthogonal, meaning they share no common words.
DBecause the vectors have zero magnitude, so similarity is undefined.
Attempts:
2 left
💡 Hint

Think about what it means when two arrows point the same way.

Predict Output
intermediate
2:00remaining
Output of cosine similarity between two text vectors

What is the output of the following code that computes cosine similarity between two text vectors?

NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ['apple orange banana', 'banana orange apple']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
sim = cosine_similarity(X[0], X[1])
print(round(sim[0][0], 2))
A0.0
B0.5
C1.0
D0.33
Attempts:
2 left
💡 Hint

Both texts have the same words but in different order.

Model Choice
advanced
2:00remaining
Best similarity measure for short text snippets

You want to find relatedness between very short texts like tweets. Which similarity measure is best?

AJaccard similarity on sets of words
BEuclidean distance on raw word counts
CManhattan distance on character counts
DCosine similarity on TF-IDF vectors
Attempts:
2 left
💡 Hint

Consider a measure that accounts for word importance and ignores length differences.

Hyperparameter
advanced
2:00remaining
Effect of stopword removal on similarity scores

How does removing stopwords before vectorizing text affect similarity scores?

AIt increases similarity scores by focusing on meaningful words.
BIt decreases similarity scores by removing common words that link texts.
CIt has no effect because stopwords are ignored by similarity measures.
DIt causes errors because vectors become empty.
Attempts:
2 left
💡 Hint

Think about what words carry meaning in text.

🔧 Debug
expert
2:00remaining
Why does this similarity code produce zero similarity for related texts?

Given two related texts, this code outputs zero similarity. What is the cause?

NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ['cat and dog', 'dog and cat']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
sim = cosine_similarity(X[0], X[1])
print(sim[0][0])
AThe vectors are sparse matrices and need to be converted to dense arrays before similarity.
BThe code is correct and should output 1.0; zero means an environment error.
CThe cosine_similarity function expects 1D arrays, but gets 2D sparse matrices causing zero output.
DThe CountVectorizer default token pattern excludes all words, resulting in empty vectors.
Attempts:
2 left
💡 Hint

Check if the code runs as expected in a normal Python environment.