What is TF-IDF in NLP: Explanation and Example
TF-IDF stands for Term Frequency-Inverse Document Frequency, a technique in NLP to measure how important a word is in a document compared to a collection of documents. It helps highlight words that are unique or meaningful in a text by balancing how often they appear with how common they are across all texts.How It Works
Imagine you have a basket of fruits representing documents, and you want to find which fruits (words) are special in each basket. Term Frequency (TF) counts how often a word appears in one document, like counting apples in one basket. Inverse Document Frequency (IDF) checks how rare that word is across all baskets, giving more value to fruits that appear in fewer baskets.
By multiplying TF and IDF, TF-IDF scores words higher if they appear often in one document but rarely in others. This helps computers focus on words that carry more meaning for that specific document, ignoring common words like "the" or "and" that appear everywhere.
Example
This example shows how to calculate TF-IDF scores for a small set of documents using Python's sklearn library.
from sklearn.feature_extraction.text import TfidfVectorizer # Sample documents texts = [ "the cat sat on the mat", "the dog sat on the log", "cats and dogs are friends" ] # Create TF-IDF vectorizer vectorizer = TfidfVectorizer() # Fit and transform the texts tfidf_matrix = vectorizer.fit_transform(texts) # Get feature names (words) feature_names = vectorizer.get_feature_names_out() # Convert matrix to array for display tfidf_array = tfidf_matrix.toarray() # Print TF-IDF scores for each document for i, doc in enumerate(tfidf_array): print(f"Document {i+1} TF-IDF scores:") for word, score in zip(feature_names, doc): if score > 0: print(f" {word}: {score:.3f}") print()
When to Use
Use TF-IDF when you want to find important words in documents, especially for tasks like search engines, document classification, or recommendation systems. It helps highlight keywords that describe the content well without being distracted by common words.
For example, a search engine uses TF-IDF to rank pages by how relevant their words are to your query. In spam detection, it helps spot unusual words that might indicate spam. It works best when you have many documents and want to compare their content.
Key Points
- TF measures how often a word appears in one document.
- IDF measures how rare a word is across all documents.
- TF-IDF combines both to find important words.
- It helps ignore common words and focus on meaningful ones.
- Widely used in search, text classification, and information retrieval.
