0
0
MlopsHow-ToBeginner · 4 min read

How to Use TfidfVectorizer in sklearn with Python

Use TfidfVectorizer from sklearn.feature_extraction.text to convert text documents into TF-IDF feature vectors. Initialize it, then call fit_transform() on your text data to get the TF-IDF matrix.
📐

Syntax

The basic syntax to use TfidfVectorizer is:

  • TfidfVectorizer(): Creates the vectorizer object with optional parameters.
  • fit_transform(texts): Learns vocabulary and idf from the texts, then transforms texts into TF-IDF features.
  • transform(new_texts): Transforms new texts using the learned vocabulary and idf.
python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)
new_tfidf = vectorizer.transform(new_texts)
💻

Example

This example shows how to convert a list of text documents into a TF-IDF matrix and print the feature names and matrix values.

python
from sklearn.feature_extraction.text import TfidfVectorizer

texts = [
    "the cat sat on the mat",
    "the dog ate my homework",
    "cats and dogs are friends"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

print("Feature names:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", tfidf_matrix.toarray())
Output
Feature names: ['and' 'are' 'ate' 'cat' 'cats' 'dog' 'dogs' 'homework' 'mat' 'my' 'on' 'sat' 'the' 'friends'] TF-IDF matrix: [[0. 0. 0. 0.57973867 0. 0. 0. 0. 0.81480247 0. 0.57973867 0.57973867 0.41482111 0. ] [0. 0. 0.70710678 0. 0. 0.70710678 0. 0.70710678 0. 0.70710678 0. 0. 0.41482111 0. ] [0.57735027 0.57735027 0. 0. 0.57735027 0. 0.57735027 0. 0. 0. 0. 0. 0.41482111 0.57735027]]
⚠️

Common Pitfalls

Common mistakes when using TfidfVectorizer:

  • Not fitting the vectorizer before transforming new data causes errors.
  • Passing raw strings to transform() without fitting first.
  • Ignoring stop words which can add noise; use stop_words='english' to remove common words.
  • Not converting sparse matrix to array when printing or processing.
python
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["hello world", "machine learning"]
vectorizer = TfidfVectorizer()

# Wrong: transform before fit
# tfidf_matrix = vectorizer.transform(texts)  # This will raise an error

# Right way:
tfidf_matrix = vectorizer.fit_transform(texts)

print(tfidf_matrix.toarray())
Output
[[0.70710678 0.70710678] [0.70710678 0.70710678]]
📊

Quick Reference

Key parameters of TfidfVectorizer:

  • stop_words='english': Removes common English words.
  • max_features=1000: Limits vocabulary size.
  • ngram_range=(1,2): Includes unigrams and bigrams.
  • max_df=0.8: Ignores words in more than 80% of documents.
  • min_df=2: Ignores words in fewer than 2 documents.

Key Takeaways

Initialize TfidfVectorizer and call fit_transform on your text data to get TF-IDF features.
Use transform on new data only after fitting the vectorizer to training data.
Convert the sparse TF-IDF matrix to an array for easy viewing or processing.
Use parameters like stop_words and ngram_range to customize text processing.
Common errors include calling transform before fit and ignoring stop words.