NLP Program to Find Similar Documents Using Cosine Similarity
TF-IDF vectorizer to convert documents into vectors and cosine similarity to find similarity scores; for example, use sklearn.feature_extraction.text.TfidfVectorizer and sklearn.metrics.pairwise.cosine_similarity to find similar documents.Examples
How to Think About It
Algorithm
Code
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity documents = [ "I love machine learning", "Machine learning is fun", "I enjoy sports" ] vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(documents) similarity_matrix = cosine_similarity(tfidf_matrix) print(similarity_matrix)
Dry Run
Let's trace the example documents through the code
Input documents
["I love machine learning", "Machine learning is fun", "I enjoy sports"]
TF-IDF vectorization
Convert each document into a vector representing word importance, e.g., doc1 vector might be [0.5, 0.7, 0.0, ...]
Calculate cosine similarity
Compute similarity scores between each pair of vectors, resulting in a 3x3 matrix.
Output similarity matrix
[[1.0, 0.79, 0.0], [0.79, 1.0, 0.0], [0.0, 0.0, 1.0]]
| Document Pair | Similarity Score |
|---|---|
| Doc1 & Doc1 | 1.0 |
| Doc1 & Doc2 | 0.79 |
| Doc1 & Doc3 | 0.0 |
| Doc2 & Doc2 | 1.0 |
| Doc2 & Doc3 | 0.0 |
| Doc3 & Doc3 | 1.0 |
Why This Works
Step 1: TF-IDF Vectorization
TF-IDF converts text into numbers that show how important each word is in a document compared to all documents, helping capture meaning.
Step 2: Cosine Similarity Calculation
Cosine similarity measures the angle between two vectors; smaller angles mean more similar documents.
Step 3: Similarity Matrix Output
The matrix shows similarity scores between all document pairs, with 1 meaning identical and 0 meaning no similarity.
Alternative Approaches
import numpy as np from sklearn.metrics.pairwise import cosine_similarity # Example word vectors word_vectors = { 'machine': np.array([1, 0]), 'learning': np.array([0.9, 0.1]), 'fun': np.array([0, 1]), 'love': np.array([0.8, 0.2]), 'enjoy': np.array([0.7, 0.3]), 'sports': np.array([0, 0.9]) } def document_vector(doc): words = doc.lower().split() vectors = [word_vectors[w] for w in words if w in word_vectors] if vectors: return np.mean(vectors, axis=0) else: return np.zeros(2) docs = ["I love machine learning", "Machine learning is fun", "I enjoy sports"] vectors = np.array([document_vector(d) for d in docs]) sim_matrix = cosine_similarity(vectors) print(sim_matrix)
from sklearn.feature_extraction.text import CountVectorizer import numpy as np docs = ["I love machine learning", "Machine learning is fun", "I enjoy sports"] vectorizer = CountVectorizer(binary=True) binary_matrix = vectorizer.fit_transform(docs).toarray() # Jaccard similarity function def jaccard_similarity(x, y): intersection = np.sum(np.minimum(x, y)) union = np.sum(np.maximum(x, y)) return intersection / union if union != 0 else 0 size = len(docs) sim_matrix = np.zeros((size, size)) for i in range(size): for j in range(size): sim_matrix[i, j] = jaccard_similarity(binary_matrix[i], binary_matrix[j]) print(sim_matrix)
Complexity: O(n^2 * m) time, O(n * m) space
Time Complexity
Calculating TF-IDF vectors takes O(n * m) where n is documents and m is vocabulary size; computing pairwise cosine similarity is O(n^2 * m) because each pair of documents is compared.
Space Complexity
Storing TF-IDF vectors requires O(n * m) space; similarity matrix requires O(n^2) space.
Which Approach is Fastest?
TF-IDF with cosine similarity is efficient for moderate datasets; word embeddings add overhead loading vectors; Jaccard similarity is simpler but less precise.
| Approach | Time | Space | Best For |
|---|---|---|---|
| TF-IDF + Cosine Similarity | O(n^2 * m) | O(n * m) | Balanced accuracy and speed for text similarity |
| Word Embeddings Average + Cosine | O(n * l + n^2 * d) | O(n * d) | Semantic similarity with pre-trained vectors |
| Count Vectorizer + Jaccard | O(n^2 * m) | O(n * m) | Simple presence-based similarity, less precise |
