Document similarity ranking helps find how close or related two texts are. It helps organize and find important documents quickly.
Document similarity ranking in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # List of documents documents = ["text1", "text2", "text3"] # Convert documents to vectors vectorizer = TfidfVectorizer() doc_vectors = vectorizer.fit_transform(documents) # Compute similarity matrix similarity_matrix = cosine_similarity(doc_vectors)
TfidfVectorizer converts text into numbers that show importance of words.
cosine_similarity measures how close two vectors are, from 0 (not similar) to 1 (exactly similar).
documents = ["I love apples", "I like oranges", "Apples and oranges are fruits"] vectorizer = TfidfVectorizer() doc_vectors = vectorizer.fit_transform(documents) similarity_matrix = cosine_similarity(doc_vectors) print(similarity_matrix)
query = "I enjoy apples" query_vec = vectorizer.transform([query]) scores = cosine_similarity(query_vec, doc_vectors) print(scores)
This program converts four documents into number vectors and calculates how similar each document is to the others. It prints a matrix showing similarity scores between 0 and 1.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample documents documents = [ "Machine learning is fun", "Artificial intelligence and machine learning", "I love reading about AI", "The sky is blue and beautiful" ] # Convert documents to TF-IDF vectors vectorizer = TfidfVectorizer() doc_vectors = vectorizer.fit_transform(documents) # Compute similarity matrix similarity_matrix = cosine_similarity(doc_vectors) # Print similarity matrix rounded to 2 decimals for i, row in enumerate(similarity_matrix): print(f"Document {i} similarities:", [round(score, 2) for score in row])
Higher similarity scores mean documents are more alike.
TF-IDF helps reduce the effect of common words like 'the' or 'is'.
Cosine similarity works well for text because it focuses on the angle between vectors, not their length.
Document similarity ranking helps find related texts by comparing their content.
Use TF-IDF to turn text into numbers that show word importance.
Cosine similarity measures how close two documents are, giving a score from 0 to 1.
Practice
Solution
Step 1: Understand the purpose of document similarity ranking
Document similarity ranking is used to compare texts and find how closely related they are based on their content.Step 2: Identify the correct description
Among the options, only finding relatedness of texts matches the purpose of document similarity ranking.Final Answer:
Find how related two texts are based on their content -> Option AQuick Check:
Document similarity ranking = Find related texts [OK]
- Confusing similarity ranking with translation
- Thinking it summarizes documents
- Mixing it up with spell checking
A and B in Python using NumPy?Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their magnitudes (norms).Step 2: Match formula to code
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) correctly implements this formula using np.dot and np.linalg.norm.Final Answer:
np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) -> Option BQuick Check:
Cosine similarity formula = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) [OK]
- Adding norms instead of multiplying
- Subtracting norms instead of dividing
- Multiplying dot product by sum of norms
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity docs = ['apple orange banana', 'banana fruit apple'] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs) sim_score = cosine_similarity(X[0], X[1])[0][0] print(round(sim_score, 2))
Solution
Step 1: Understand TF-IDF vectorization of similar documents
Both documents share words 'apple' and 'banana' and have similar content, so their TF-IDF vectors will be close.Step 2: Calculate cosine similarity between vectors
Cosine similarity between these vectors will be high but less than 1, approximately 0.58 after rounding.Final Answer:
0.58 -> Option CQuick Check:
Similarity of similar docs โ 0.58 [OK]
- Assuming similarity is exactly 1 for similar texts
- Confusing cosine similarity with Euclidean distance
- Ignoring TF-IDF weighting effects
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity docs = ['cat dog', 'dog mouse'] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs).toarray() sim_score = cosine_similarity(X[0], X[1]) print(sim_score)
Solution
Step 1: Check input types for cosine_similarity
cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays (shape (n_features,)).Step 2: Understand how to fix the error
Use X[0:1] and X[1:2] or reshape them properly to avoid the error.Final Answer:
cosine_similarity expects 2D arrays, but X[0] and X[1] are 1D arrays -> Option AQuick Check:
cosine_similarity input shape = 2D arrays [OK]
- Thinking TfidfVectorizer fails on different words
- Thinking cosine_similarity accepts 1D arrays
- Overlooking variable name typos
Solution
Step 1: Understand ranking by similarity
To rank documents by similarity to a query, compute vector representations and measure similarity scores, then sort descending (highest similarity first).Step 2: Identify correct method
TF-IDF vectors and cosine similarity are standard; ranking by descending cosine similarity scores is correct.Final Answer:
Compute TF-IDF vectors for all documents and query, then rank by cosine similarity scores descending -> Option DQuick Check:
Similarity ranking = cosine similarity descending [OK]
- Ranking by ascending similarity (lowest first)
- Using raw counts without weighting
- Ranking by overlap count ascending
