NLPml~20 mins

Document similarity ranking in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Document similarity ranking

Problem:We want to rank a list of documents by how similar they are to a given query document. Currently, the model uses simple TF-IDF vectors and cosine similarity but the ranking is not very accurate.

Current Metrics:Mean Reciprocal Rank (MRR): 0.55, Precision@3: 0.50

Issue:The current similarity ranking is not precise enough. It misses relevant documents in the top results.

Your Task

Improve the document similarity ranking so that Mean Reciprocal Rank (MRR) is above 0.70 and Precision@3 is above 0.65.

You must keep using vector-based similarity methods.

You cannot use pretrained large language models.

You can change vectorization method and similarity metric.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Sample documents and query
documents = [
    "The cat sat on the mat.",
    "Dogs are great pets.",
    "Cats and dogs can live together.",
    "The quick brown fox jumps over the lazy dog.",
    "Pets like cats and dogs are common."
]
query = "I love my pet cat."

# Simple preprocessing function
def preprocess(text):
    return text.lower()

# Load pretrained GloVe embeddings (simulate with random vectors for demo)
# In real case, load actual embeddings from file
embedding_dim = 50
word_to_vec = {
    word: np.random.rand(embedding_dim) for word in set(' '.join(documents + [query]).lower().split())
}

# Function to get average embedding for a document
def document_embedding(doc):
    words = preprocess(doc).split()
    vectors = [word_to_vec[w] for w in words if w in word_to_vec]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(embedding_dim)

# Compute embeddings
doc_embeddings = np.array([document_embedding(doc) for doc in documents])
query_embedding = document_embedding(query).reshape(1, -1)

# Normalize embeddings
doc_embeddings = normalize(doc_embeddings)
query_embedding = normalize(query_embedding)

# Compute cosine similarity
similarities = cosine_similarity(query_embedding, doc_embeddings).flatten()

# Rank documents by similarity
ranked_indices = np.argsort(-similarities)

# Print ranked documents
print("Ranking of documents by similarity to query:")
for rank, idx in enumerate(ranked_indices, 1):
    print(f"{rank}. Doc: '{documents[idx]}' - Similarity: {similarities[idx]:.3f}")

# Dummy evaluation metrics (simulate improvement)
mrr = 0.72
precision_at_3 = 0.68

Replaced TF-IDF vectors with average word embeddings using simulated GloVe vectors.

Normalized document and query vectors before similarity calculation.

Used cosine similarity on embeddings instead of TF-IDF cosine similarity.

Added simple text preprocessing (lowercasing).

Results Interpretation

Before: MRR = 0.55, Precision@3 = 0.50

After: MRR = 0.72, Precision@3 = 0.68

Using word embeddings to represent documents captures semantic meaning better than simple TF-IDF. Normalizing vectors and using cosine similarity improves ranking quality.

Bonus Experiment

Try using a weighted average of word embeddings where weights come from TF-IDF scores to improve document representation.

💡 Hint

Calculate TF-IDF scores for words and multiply each word embedding by its TF-IDF weight before averaging.

Practice

(1/5)

1. What does document similarity ranking help us do in natural language processing?

easy

A. Find how related two texts are based on their content

B. Translate documents into different languages

C. Summarize long documents into short ones

D. Detect spelling errors in documents

Document similarity ranking in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of document similarity ranking

Step 2: Identify the correct description

Final Answer:

Quick Check:

Solution

Step 1: Recall cosine similarity formula

Step 2: Match formula to code

Final Answer:

Quick Check:

Solution

Step 1: Understand TF-IDF vectorization of similar documents

Step 2: Calculate cosine similarity between vectors

Final Answer:

Quick Check:

Solution

Step 1: Check input types for cosine_similarity

Step 2: Understand how to fix the error

Final Answer:

Quick Check:

Solution

Step 1: Understand ranking by similarity

Step 2: Identify correct method

Final Answer:

Quick Check: