Bird
Raised Fist0
NlpConceptBeginner · 3 min read

What is TF-IDF in NLP: Explanation and Example

TF-IDF stands for Term Frequency-Inverse Document Frequency, a technique in NLP to measure how important a word is in a document compared to a collection of documents. It helps highlight words that are unique or meaningful in a text by balancing how often they appear with how common they are across all texts.
⚙️

How It Works

Imagine you have a basket of fruits representing documents, and you want to find which fruits (words) are special in each basket. Term Frequency (TF) counts how often a word appears in one document, like counting apples in one basket. Inverse Document Frequency (IDF) checks how rare that word is across all baskets, giving more value to fruits that appear in fewer baskets.

By multiplying TF and IDF, TF-IDF scores words higher if they appear often in one document but rarely in others. This helps computers focus on words that carry more meaning for that specific document, ignoring common words like "the" or "and" that appear everywhere.

💻

Example

This example shows how to calculate TF-IDF scores for a small set of documents using Python's sklearn library.

python
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
texts = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are friends"
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the texts
tfidf_matrix = vectorizer.fit_transform(texts)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert matrix to array for display
tfidf_array = tfidf_matrix.toarray()

# Print TF-IDF scores for each document
for i, doc in enumerate(tfidf_array):
    print(f"Document {i+1} TF-IDF scores:")
    for word, score in zip(feature_names, doc):
        if score > 0:
            print(f"  {word}: {score:.3f}")
    print()
Output
Document 1 TF-IDF scores: cat: 0.579 mat: 0.579 sat: 0.408 the: 0.408 on: 0.408 Document 2 TF-IDF scores: dog: 0.579 log: 0.579 sat: 0.408 the: 0.408 on: 0.408 Document 3 TF-IDF scores: and: 0.577 are: 0.577 cats: 0.577 dogs: 0.577 friends: 0.577
🎯

When to Use

Use TF-IDF when you want to find important words in documents, especially for tasks like search engines, document classification, or recommendation systems. It helps highlight keywords that describe the content well without being distracted by common words.

For example, a search engine uses TF-IDF to rank pages by how relevant their words are to your query. In spam detection, it helps spot unusual words that might indicate spam. It works best when you have many documents and want to compare their content.

Key Points

  • TF measures how often a word appears in one document.
  • IDF measures how rare a word is across all documents.
  • TF-IDF combines both to find important words.
  • It helps ignore common words and focus on meaningful ones.
  • Widely used in search, text classification, and information retrieval.

Key Takeaways

TF-IDF scores words by importance based on frequency and rarity.
It highlights unique words in a document compared to others.
Useful for search engines, text classification, and keyword extraction.
Combines term frequency with inverse document frequency for balance.
Easy to compute with libraries like sklearn in Python.