0
0
NLPml~5 mins

TF-IDF (TfidfVectorizer) in NLP

Choose your learning style9 modes available
Introduction

TF-IDF helps find important words in text by giving more weight to rare words and less to common ones.

When you want to find key words in customer reviews.
When you need to convert text into numbers for machine learning.
When you want to compare documents by their important words.
When filtering out common words like 'the' or 'and' in text analysis.
When building search engines to rank documents by relevance.
Syntax
NLP
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=None,  # max number of words to keep
    stop_words=None,    # words to ignore
    ngram_range=(1,1)   # single words by default
)
X = tfidf.fit_transform(documents)

fit_transform learns the important words and converts text to numbers.

You can set stop_words='english' to ignore common English words.

Examples
This ignores common English words like 'are' and focuses on important words.
NLP
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(['I love cats', 'Cats are great pets'])
This counts single words and pairs of words (like 'love cats').
NLP
tfidf = TfidfVectorizer(ngram_range=(1,2))
X = tfidf.fit_transform(['I love cats', 'Cats are great pets'])
This keeps only the top 3 important words.
NLP
tfidf = TfidfVectorizer(max_features=3)
X = tfidf.fit_transform(['I love cats', 'Cats are great pets'])
Sample Model

This code converts three sentences into numbers showing how important each word is, ignoring common words like 'the' and 'on'.

NLP
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    'The cat sat on the mat.',
    'The dog ate my homework.',
    'Cats and dogs are great pets.'
]

# Create TF-IDF vectorizer ignoring English stop words
vectorizer = TfidfVectorizer(stop_words='english')

# Learn vocabulary and transform documents
X = vectorizer.fit_transform(documents)

# Show feature names (words)
print('Words:', vectorizer.get_feature_names_out())

# Show TF-IDF matrix as array
print('TF-IDF matrix:\n', X.toarray())
OutputSuccess
Important Notes

TF-IDF values range from 0 to 1, where higher means more important in that document.

Common words get low scores because they appear in many documents.

You can use the TF-IDF matrix as input for machine learning models.

Summary

TF-IDF finds important words by balancing word frequency and rarity.

TfidfVectorizer converts text into numbers for easy analysis.

It helps machines understand text by focusing on meaningful words.