Document similarity ranking helps find how close or related two texts are. It helps organize and find important documents quickly.
Document similarity ranking in NLP
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # List of documents documents = ["text1", "text2", "text3"] # Convert documents to vectors vectorizer = TfidfVectorizer() doc_vectors = vectorizer.fit_transform(documents) # Compute similarity matrix similarity_matrix = cosine_similarity(doc_vectors)
TfidfVectorizer converts text into numbers that show importance of words.
cosine_similarity measures how close two vectors are, from 0 (not similar) to 1 (exactly similar).
documents = ["I love apples", "I like oranges", "Apples and oranges are fruits"] vectorizer = TfidfVectorizer() doc_vectors = vectorizer.fit_transform(documents) similarity_matrix = cosine_similarity(doc_vectors) print(similarity_matrix)
query = "I enjoy apples" query_vec = vectorizer.transform([query]) scores = cosine_similarity(query_vec, doc_vectors) print(scores)
This program converts four documents into number vectors and calculates how similar each document is to the others. It prints a matrix showing similarity scores between 0 and 1.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Sample documents documents = [ "Machine learning is fun", "Artificial intelligence and machine learning", "I love reading about AI", "The sky is blue and beautiful" ] # Convert documents to TF-IDF vectors vectorizer = TfidfVectorizer() doc_vectors = vectorizer.fit_transform(documents) # Compute similarity matrix similarity_matrix = cosine_similarity(doc_vectors) # Print similarity matrix rounded to 2 decimals for i, row in enumerate(similarity_matrix): print(f"Document {i} similarities:", [round(score, 2) for score in row])
Higher similarity scores mean documents are more alike.
TF-IDF helps reduce the effect of common words like 'the' or 'is'.
Cosine similarity works well for text because it focuses on the angle between vectors, not their length.
Document similarity ranking helps find related texts by comparing their content.
Use TF-IDF to turn text into numbers that show word importance.
Cosine similarity measures how close two documents are, giving a score from 0 to 1.