0
0
NLPml~5 mins

Document similarity ranking in NLP

Choose your learning style9 modes available
Introduction

Document similarity ranking helps find how close or related two texts are. It helps organize and find important documents quickly.

Finding similar news articles to a given article
Recommending research papers related to a topic
Grouping customer reviews that talk about the same issue
Searching for documents that match a user's query
Detecting duplicate or near-duplicate documents in a database
Syntax
NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# List of documents
documents = ["text1", "text2", "text3"]

# Convert documents to vectors
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(documents)

# Compute similarity matrix
similarity_matrix = cosine_similarity(doc_vectors)

TfidfVectorizer converts text into numbers that show importance of words.

cosine_similarity measures how close two vectors are, from 0 (not similar) to 1 (exactly similar).

Examples
This example shows similarity scores between three simple sentences.
NLP
documents = ["I love apples", "I like oranges", "Apples and oranges are fruits"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(documents)
similarity_matrix = cosine_similarity(doc_vectors)
print(similarity_matrix)
This shows how to find similarity scores of a new query against existing documents.
NLP
query = "I enjoy apples"
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, doc_vectors)
print(scores)
Sample Model

This program converts four documents into number vectors and calculates how similar each document is to the others. It prints a matrix showing similarity scores between 0 and 1.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "Machine learning is fun",
    "Artificial intelligence and machine learning",
    "I love reading about AI",
    "The sky is blue and beautiful"
]

# Convert documents to TF-IDF vectors
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(documents)

# Compute similarity matrix
similarity_matrix = cosine_similarity(doc_vectors)

# Print similarity matrix rounded to 2 decimals
for i, row in enumerate(similarity_matrix):
    print(f"Document {i} similarities:", [round(score, 2) for score in row])
OutputSuccess
Important Notes

Higher similarity scores mean documents are more alike.

TF-IDF helps reduce the effect of common words like 'the' or 'is'.

Cosine similarity works well for text because it focuses on the angle between vectors, not their length.

Summary

Document similarity ranking helps find related texts by comparing their content.

Use TF-IDF to turn text into numbers that show word importance.

Cosine similarity measures how close two documents are, giving a score from 0 to 1.