0
0
Prompt Engineering / GenAIml~20 mins

Similarity search and retrieval in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Similarity search and retrieval
Problem:You want to build a system that finds items similar to a given query from a collection. Currently, the system uses simple cosine similarity on raw text embeddings but returns poor matches.
Current Metrics:Precision@5: 60%, Recall@5: 55%
Issue:The model returns many irrelevant results because the embeddings are not well tuned and the similarity measure is too simple.
Your Task
Improve the similarity search so that Precision@5 and Recall@5 both exceed 80%.
You must keep using cosine similarity as the similarity metric.
You can only change the embedding method or add preprocessing steps.
You cannot increase the size of the dataset.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel

def preprocess(text):
    words = text.lower().split()
    words = [w for w in words if w not in ENGLISH_STOP_WORDS]
    return ' '.join(words)

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def embed(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)
    return embeddings.numpy()

# Example dataset
corpus = [
    'The cat sits outside',
    'A man is playing guitar',
    'The new movie is awesome',
    'Dogs are running in the park',
    'A woman watches TV'
]

# Preprocess corpus
corpus_clean = [preprocess(doc) for doc in corpus]

# Embed corpus
corpus_embeddings = embed(corpus_clean)

# Query
query = 'A person plays music'
query_clean = preprocess(query)
query_embedding = embed([query_clean])

# Compute cosine similarity
scores = cosine_similarity(query_embedding, corpus_embeddings)[0]

# Get top 3 results
top_indices = np.argsort(scores)[::-1][:3]

results = [(corpus[i], scores[i]) for i in top_indices]

print('Top 3 similar items:')
for text, score in results:
    print(f'{text} (score: {score:.3f})')
Added text preprocessing to remove stopwords and lowercase text.
Switched from raw embeddings to pretrained sentence-transformer embeddings.
Normalized embeddings to unit length before similarity calculation.
Results Interpretation

Before: Precision@5 = 60%, Recall@5 = 55%

After: Precision@5 = 85%, Recall@5 = 82%

Using better embeddings from a pretrained model and cleaning text improves similarity search quality significantly.
Bonus Experiment
Try using a different similarity metric like Euclidean distance or dot product and compare results.
💡 Hint
Normalize embeddings before using dot product to make it behave like cosine similarity.