Complete the code to compute cosine similarity between two vectors.
from sklearn.metrics.pairwise import [1] vec1 = [[1, 2, 3]] vec2 = [[4, 5, 6]] similarity = [1](vec1, vec2) print(similarity)
The cosine_similarity function from sklearn.metrics.pairwise calculates the cosine similarity between vectors, which is commonly used to measure document similarity.
Complete the code to convert text documents into TF-IDF vectors.
from sklearn.feature_extraction.text import [1] corpus = ['I love machine learning', 'Machine learning is fun'] tfidf_vectorizer = [1]() tfidf_matrix = tfidf_vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray())
TfidfVectorizer converts text documents into TF-IDF feature vectors, which reflect the importance of words in documents.
Fix the error in the code to correctly compute similarity scores between documents.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import [1] texts = ['Data science is cool', 'I love data science'] vectorizer = TfidfVectorizer() matrix = vectorizer.fit_transform(texts) scores = [1](matrix) print(scores)
To get similarity scores, cosine_similarity is the correct function. Distance functions return dissimilarity, not similarity.
Fill both blanks to create a dictionary of document similarity scores above a threshold.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity texts = ['AI is the future', 'AI and ML are related', 'I enjoy sports'] vectorizer = TfidfVectorizer() matrix = vectorizer.fit_transform(texts) sim_matrix = cosine_similarity(matrix) threshold = 0.5 similar_docs = {i: [j for j in range(len(texts)) if sim_matrix[i][j] [1] threshold and i != j] for i in range(len(texts)) if any(sim_matrix[i][j] [2] threshold for j in range(len(texts)))} print(similar_docs)
The code filters pairs with similarity scores strictly greater than the threshold, so both blanks use the '>' operator.
Fill all three blanks to rank documents by similarity to a query document.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity corpus = ['Deep learning is powerful', 'I like deep learning', 'Cats are cute'] vectorizer = TfidfVectorizer() matrix = vectorizer.fit_transform(corpus) query = ['I love learning'] query_vec = vectorizer.transform(query) sim_scores = cosine_similarity(query_vec, matrix)[0] ranked_docs = sorted(((i, sim_scores[i]) for i in range(len(corpus))), key=lambda x: x[1] x[2], reverse=[3]) print(ranked_docs)
The sorting key uses -x[1] to sort by descending similarity scores. The reverse=True argument ensures highest scores come first.