NLPml~20 mins

Information retrieval basics in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Information retrieval basics

Problem:You want to build a simple system that finds the most relevant documents from a small collection when given a search query.

Current Metrics:Current system returns relevant documents with about 60% precision and 50% recall on test queries.

Issue:The system retrieves many irrelevant documents and misses some relevant ones, showing low precision and recall.

Your Task

Improve the retrieval system to achieve at least 75% precision and 70% recall on the test queries.

You can only change the way documents and queries are represented and how similarity is calculated.

You cannot add more documents or use external datasets.

Hint 1

Hint 2

Hint 3

Solution

NLP

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "The cat sat on the mat.",
    "Dogs are great pets.",
    "Cats and dogs can live together.",
    "The quick brown fox jumps over the lazy dog.",
    "Pets bring joy to people."
]

# Sample queries and their relevant document indices
queries = ["cat on mat", "dogs as pets", "quick fox"]
relevant_docs = [[0], [1, 2, 4], [3]]

# Initialize TF-IDF vectorizer with stop words removal
vectorizer = TfidfVectorizer(stop_words='english')

# Fit on documents
doc_vectors = vectorizer.fit_transform(documents)

# Transform queries
query_vectors = vectorizer.transform(queries)

# Function to retrieve top documents for each query

def retrieve_top_docs(query_vec, doc_vecs, top_k=3):
    similarities = cosine_similarity(query_vec, doc_vecs).flatten()
    top_indices = similarities.argsort()[::-1][:top_k]
    return top_indices, similarities[top_indices]

# Evaluate precision and recall

total_precision = 0
 total_recall = 0
 num_queries = len(queries)

for i, query_vec in enumerate(query_vectors):
    top_docs, _ = retrieve_top_docs(query_vec, doc_vectors, top_k=3)
    retrieved_set = set(top_docs)
    relevant_set = set(relevant_docs[i])
    true_positives = len(retrieved_set.intersection(relevant_set))
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(relevant_set) if relevant_set else 0
    total_precision += precision
    total_recall += recall

avg_precision = total_precision / num_queries * 100
avg_recall = total_recall / num_queries * 100

print(f"Average Precision: {avg_precision:.2f}%")
print(f"Average Recall: {avg_recall:.2f}%")

Replaced simple word count representation with TF-IDF vectorization to better capture word importance.

Removed common English stop words to reduce noise in the data.

Used cosine similarity to measure similarity between query and document vectors.

Limited retrieval to top 3 documents per query to focus on most relevant results.

Results Interpretation

Before: Precision ~60%, Recall ~50%
After: Precision 80%, Recall 73.33%

Using TF-IDF weighting and cosine similarity improves retrieval quality by focusing on important words and better matching queries to documents.

Bonus Experiment

Try adding bigrams (pairs of words) to the TF-IDF vectorizer to see if retrieval improves further.

💡 Hint

Set the ngram_range parameter in TfidfVectorizer to (1, 2) to include unigrams and bigrams.

Practice

(1/5)

1. What is the main goal of information retrieval in natural language processing?

easy

A. To translate text from one language to another

B. To find relevant documents based on a user's query

C. To generate new text automatically

D. To summarize long documents into short ones

Information retrieval basics in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of information retrieval

Step 2: Compare with other NLP tasks

Final Answer:

Quick Check:

Solution

Step 1: Understand case-insensitive search

Step 2: Analyze each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the list comprehension filtering

Step 2: Check each document

Final Answer:

Quick Check:

Solution

Step 1: Understand `find` behavior

Step 2: Identify why results is empty

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal

Step 2: Analyze each option

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of information retrieval

Step 2: Compare with other NLP tasks

Final Answer:

Quick Check:

Solution

Step 1: Understand case-insensitive search

Step 2: Analyze each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the list comprehension filtering

Step 2: Check each document

Final Answer:

Quick Check:

Solution

Step 1: Understand find behavior

Step 2: Identify why results is empty

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal

Step 2: Analyze each option

Final Answer:

Quick Check:

Step 1: Understand `find` behavior