How to use TfidfVectorizer sklearn in python

MlopsHow-ToBeginner · 4 min read

How to Use TfidfVectorizer in sklearn with Python

Use TfidfVectorizer from sklearn.feature_extraction.text to convert text documents into TF-IDF feature vectors. Initialize it, then call fit_transform() on your text data to get the TF-IDF matrix.

📐

Syntax

The basic syntax to use TfidfVectorizer is:

TfidfVectorizer(): Creates the vectorizer object with optional parameters.
fit_transform(texts): Learns vocabulary and idf from the texts, then transforms texts into TF-IDF features.
transform(new_texts): Transforms new texts using the learned vocabulary and idf.

python

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)
new_tfidf = vectorizer.transform(new_texts)

💻

Example

This example shows how to convert a list of text documents into a TF-IDF matrix and print the feature names and matrix values.

python

from sklearn.feature_extraction.text import TfidfVectorizer

texts = [
    "the cat sat on the mat",
    "the dog ate my homework",
    "cats and dogs are friends"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

print("Feature names:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", tfidf_matrix.toarray())

Output

Feature names: ['and' 'are' 'ate' 'cat' 'cats' 'dog' 'dogs' 'homework' 'mat' 'my' 'on' 'sat' 'the' 'friends'] TF-IDF matrix: [[0. 0. 0. 0.57973867 0. 0. 0. 0. 0.81480247 0. 0.57973867 0.57973867 0.41482111 0. ] [0. 0. 0.70710678 0. 0. 0.70710678 0. 0.70710678 0. 0.70710678 0. 0. 0.41482111 0. ] [0.57735027 0.57735027 0. 0. 0.57735027 0. 0.57735027 0. 0. 0. 0. 0. 0.41482111 0.57735027]]

⚠️

Common Pitfalls

Common mistakes when using TfidfVectorizer:

Not fitting the vectorizer before transforming new data causes errors.
Passing raw strings to transform() without fitting first.
Ignoring stop words which can add noise; use stop_words='english' to remove common words.
Not converting sparse matrix to array when printing or processing.

python

from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["hello world", "machine learning"]
vectorizer = TfidfVectorizer()

# Wrong: transform before fit
# tfidf_matrix = vectorizer.transform(texts)  # This will raise an error

# Right way:
tfidf_matrix = vectorizer.fit_transform(texts)

print(tfidf_matrix.toarray())

Output

[[0.70710678 0.70710678] [0.70710678 0.70710678]]

📊

Quick Reference

Key parameters of TfidfVectorizer:

stop_words='english': Removes common English words.
max_features=1000: Limits vocabulary size.
ngram_range=(1,2): Includes unigrams and bigrams.
max_df=0.8: Ignores words in more than 80% of documents.
min_df=2: Ignores words in fewer than 2 documents.

✅

Key Takeaways

Initialize TfidfVectorizer and call fit_transform on your text data to get TF-IDF features.

Use transform on new data only after fitting the vectorizer to training data.

Convert the sparse TF-IDF matrix to an array for easy viewing or processing.

Use parameters like stop_words and ngram_range to customize text processing.

Common errors include calling transform before fit and ignoring stop words.