0
0
MlopsHow-ToBeginner · 3 min read

How to Use TF-IDF in Python with sklearn

Use TfidfVectorizer from sklearn.feature_extraction.text to convert text data into TF-IDF features. Fit the vectorizer on your text data with fit_transform() to get the TF-IDF matrix representing word importance.
📐

Syntax

The main class to use is TfidfVectorizer. You create an instance, then call fit_transform() on your list of text documents to get the TF-IDF matrix. You can also use transform() on new data after fitting.

  • TfidfVectorizer(): Initializes the vectorizer with optional parameters like stop_words or max_features.
  • fit_transform(texts): Learns vocabulary and idf, then transforms texts to TF-IDF features.
  • transform(texts): Transforms new texts using the learned vocabulary and idf.
python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(texts)  # texts is a list of strings

# To transform new texts after fitting:
new_tfidf = vectorizer.transform(new_texts)
💻

Example

This example shows how to create a TF-IDF matrix from a small list of text documents. It prints the feature names and the TF-IDF values for each document.

python
from sklearn.feature_extraction.text import TfidfVectorizer

texts = [
    'the cat sat on the mat',
    'the dog sat on the log',
    'cats and dogs are friends'
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)

print('Feature names:', vectorizer.get_feature_names_out())
print('TF-IDF matrix shape:', tfidf_matrix.shape)

# Convert sparse matrix to dense and print
print('TF-IDF matrix (dense):')
print(tfidf_matrix.todense())
Output
Feature names: ['cat' 'cats' 'dog' 'dogs' 'friends' 'log' 'mat' 'sat'] TF-IDF matrix shape: (3, 8) TF-IDF matrix (dense): [[0.70710678 0. 0. 0. 0. 0. 0.70710678 0.70710678] [0. 0. 0.70710678 0. 0. 0.70710678 0. 0.70710678] [0. 0.57735027 0. 0.57735027 0.57735027 0. 0. 0. ]]
⚠️

Common Pitfalls

  • Not removing stop words can cause common words like "the" or "and" to dominate the features.
  • Using fit_transform() on training data but forgetting to use transform() on test data leads to inconsistent features.
  • Not converting the sparse matrix to dense before printing can cause confusing output.
  • Setting max_features too low may exclude important words.
python
from sklearn.feature_extraction.text import TfidfVectorizer

texts_train = ['apple orange banana', 'banana fruit apple']
texts_test = ['apple fruit']

# Wrong: fitting separately on train and test
vectorizer_wrong = TfidfVectorizer()
train_matrix_wrong = vectorizer_wrong.fit_transform(texts_train)
test_matrix_wrong = vectorizer_wrong.fit_transform(texts_test)  # Wrong: should use transform()

# Right: fit on train, transform on test
vectorizer_right = TfidfVectorizer()
train_matrix_right = vectorizer_right.fit_transform(texts_train)
test_matrix_right = vectorizer_right.transform(texts_test)
📊

Quick Reference

Remember these key points when using TF-IDF in Python:

  • Use TfidfVectorizer to convert text to TF-IDF features.
  • Call fit_transform() on training data and transform() on new data.
  • Use stop_words='english' to remove common words.
  • Access feature names with get_feature_names_out().
  • TF-IDF output is a sparse matrix; convert to dense with todense() if needed.

Key Takeaways

Use sklearn's TfidfVectorizer to convert text data into TF-IDF features easily.
Always fit the vectorizer on training data and transform new data with the same vectorizer.
Remove stop words to focus on meaningful words using the stop_words parameter.
TF-IDF output is a sparse matrix; convert it to dense for easier inspection if needed.
Avoid fitting the vectorizer separately on train and test data to keep consistent features.