0
0
NLPml~10 mins

Document similarity ranking in NLP - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to compute cosine similarity between two vectors.

NLP
from sklearn.metrics.pairwise import [1]

vec1 = [[1, 2, 3]]
vec2 = [[4, 5, 6]]
similarity = [1](vec1, vec2)
print(similarity)
Drag options to blanks, or click blank then click option'
Acosine_similarity
Beuclidean_distance
Cmanhattan_distance
Ddot_product
Attempts:
3 left
💡 Hint
Common Mistakes
Using distance functions like euclidean_distance instead of similarity.
Trying to use dot_product which is not a sklearn function.
2fill in blank
medium

Complete the code to convert text documents into TF-IDF vectors.

NLP
from sklearn.feature_extraction.text import [1]

corpus = ['I love machine learning', 'Machine learning is fun']
tfidf_vectorizer = [1]()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
Drag options to blanks, or click blank then click option'
AHashingVectorizer
BTfidfVectorizer
CCountVectorizer
DDictVectorizer
Attempts:
3 left
💡 Hint
Common Mistakes
Using CountVectorizer which only counts word occurrences.
Using HashingVectorizer which does not compute TF-IDF.
3fill in blank
hard

Fix the error in the code to correctly compute similarity scores between documents.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import [1]

texts = ['Data science is cool', 'I love data science']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(texts)
scores = [1](matrix)
print(scores)
Drag options to blanks, or click blank then click option'
Amanhattan_distances
Beuclidean_distances
Cpairwise_distances
Dcosine_similarity
Attempts:
3 left
💡 Hint
Common Mistakes
Using distance functions which give dissimilarity scores.
Passing the wrong matrix shape to the function.
4fill in blank
hard

Fill both blanks to create a dictionary of document similarity scores above a threshold.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ['AI is the future', 'AI and ML are related', 'I enjoy sports']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(texts)
sim_matrix = cosine_similarity(matrix)

threshold = 0.5
similar_docs = {i: [j for j in range(len(texts)) if sim_matrix[i][j] [1] threshold and i != j] for i in range(len(texts)) if any(sim_matrix[i][j] [2] threshold for j in range(len(texts)))}
print(similar_docs)
Drag options to blanks, or click blank then click option'
A>
B>=
C<
D<=
Attempts:
3 left
💡 Hint
Common Mistakes
Using '<' or '<=' which would select less similar documents.
Mixing different operators in the two blanks.
5fill in blank
hard

Fill all three blanks to rank documents by similarity to a query document.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = ['Deep learning is powerful', 'I like deep learning', 'Cats are cute']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)

query = ['I love learning']
query_vec = vectorizer.transform(query)
sim_scores = cosine_similarity(query_vec, matrix)[0]

ranked_docs = sorted(((i, sim_scores[i]) for i in range(len(corpus))), key=lambda x: x[1] x[2], reverse=[3])
print(ranked_docs)
Drag options to blanks, or click blank then click option'
A*
B-
C>
DTrue
Attempts:
3 left
💡 Hint
Common Mistakes
Using '*' or '+' in the key which does not sort properly.
Setting reverse=False which sorts ascending.