Cosine similarity measures how close two vectors point in the same direction. It is a number between -1 and 1. A value near 1 means the vectors are very similar, 0 means no similarity, and -1 means opposite directions. This metric is important in NLP to compare text meaning, like checking if two sentences talk about the same thing.
Cosine similarity in NLP - Model Metrics & Evaluation
Cosine similarity is not a classification metric, so it does not use a confusion matrix. Instead, it outputs a similarity score. For example, if we have two vectors A and B:
A = [1, 2, 3] B = [2, 4, 6]
The cosine similarity is calculated as:
cos_sim = (A ยท B) / (||A|| * ||B||) = (1*2 + 2*4 + 3*6) / (sqrt(1^2+2^2+3^2) * sqrt(2^2+4^2+6^2)) = 28 / (3.74 * 7.48) = 1.0
This means vectors A and B point in the same direction perfectly.
Cosine similarity itself is a similarity score, not a classification metric, so precision and recall do not directly apply. However, when used to decide if two texts are similar enough (e.g., above a threshold), precision and recall become relevant:
- High threshold: Only very similar pairs are accepted. This leads to high precision (few false matches) but low recall (many true matches missed).
- Low threshold: More pairs are accepted as similar. This leads to high recall (few true matches missed) but low precision (more false matches).
Example: In a plagiarism detector, setting a high cosine similarity threshold means you catch only very close copies (high precision), but might miss clever paraphrases (low recall).
Good cosine similarity values depend on the task:
- Good: Values close to 1 for truly similar texts, showing strong semantic match.
- Bad: Values near 0 or negative for texts that should be similar, indicating poor vector representation or noisy data.
For example, two sentences with the same meaning should have cosine similarity above 0.8. If it is below 0.5, the model or embeddings might not capture meaning well.
- Ignoring vector length: Cosine similarity ignores magnitude, so two very different length vectors can have high similarity.
- Threshold choice: Picking a wrong similarity threshold can cause many false positives or negatives.
- Data quality: Poor or sparse embeddings lead to unreliable similarity scores.
- Not normalizing inputs: If vectors are not normalized, cosine similarity results can be misleading.
Your model uses cosine similarity to find similar documents. You set a threshold of 0.9 but find many true similar pairs have scores around 0.7. Is your threshold good? Why or why not?
Answer: No, the threshold is too high. It misses many true similar pairs (low recall). Lowering the threshold closer to 0.7 would catch more true matches, improving recall.