NlpHow-ToBeginner · 4 min read

How to Use TfidfVectorizer for Text in NLP

Use TfidfVectorizer from sklearn.feature_extraction.text to convert text data into numerical features by calculating term frequency-inverse document frequency (TF-IDF). Fit the vectorizer on your text data with fit_transform() to get a matrix of TF-IDF features ready for machine learning models.

📐

Syntax

The TfidfVectorizer is initialized with optional parameters to control text processing. Use fit() to learn vocabulary and IDF from training data, and transform() to convert new text. fit_transform() does both in one step.

TfidfVectorizer(): Creates the vectorizer object.
fit(texts): Learns vocabulary and IDF from texts.
transform(texts): Converts texts to TF-IDF features.
fit_transform(texts): Fits and transforms texts in one step.

python

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words='english',  # remove common English words
    max_features=1000,     # limit to top 1000 features
    ngram_range=(1, 2)     # use unigrams and bigrams
)

# Fit and transform training texts
X = vectorizer.fit_transform(texts)

# Transform new texts
X_new = vectorizer.transform(new_texts)

💻

Example

This example shows how to convert a list of text documents into a TF-IDF feature matrix using TfidfVectorizer. It prints the feature names and the TF-IDF values for each document.

python

from sklearn.feature_extraction.text import TfidfVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

print('Feature names:', vectorizer.get_feature_names_out())
print('TF-IDF matrix:\n', X.toarray())

Output

Feature names: ['coding' 'fun' 'learning' 'love' 'machine' 'python'] TF-IDF matrix: [[0. 0. 0.70710678 0.70710678 0. 0. ] [0. 0.70710678 0.5 0. 0.5 0. ] [0.70710678 0. 0. 0.70710678 0. 0.70710678]]

⚠️

Common Pitfalls

Not removing stop words can cause common words to dominate features.
Using fit_transform on test data instead of transform causes data leakage.
Ignoring text preprocessing like lowercasing or punctuation removal may reduce quality.
Setting max_features too low can lose important words.

python

from sklearn.feature_extraction.text import TfidfVectorizer

texts_train = ['I love machine learning', 'Machine learning is fun']
texts_test = ['I love coding']

# Wrong: fitting on test data causes leakage
vectorizer_wrong = TfidfVectorizer()
X_train_wrong = vectorizer_wrong.fit_transform(texts_train)
X_test_wrong = vectorizer_wrong.fit_transform(texts_test)  # wrong

# Right: fit on train, transform on test
vectorizer_right = TfidfVectorizer()
X_train_right = vectorizer_right.fit_transform(texts_train)
X_test_right = vectorizer_right.transform(texts_test)  # correct

📊

Quick Reference

Here is a quick summary of key TfidfVectorizer parameters and methods:

Parameter / Method	Description
stop_words	Remove common words like 'the', 'and' (e.g., 'english')
max_features	Limit number of features to top N by frequency
ngram_range	Tuple (min_n, max_n) to include n-grams, e.g., (1,2) for unigrams and bigrams
fit(texts)	Learn vocabulary and IDF from texts
transform(texts)	Convert texts to TF-IDF features using learned vocabulary
fit_transform(texts)	Fit and transform texts in one step
get_feature_names_out()	Get list of feature names (words or n-grams)

✅

Key Takeaways

Use TfidfVectorizer to convert text into numerical features based on word importance.

Always fit the vectorizer on training data and transform test data to avoid data leakage.

Remove stop words and consider n-grams to improve feature quality.

Check feature names with get_feature_names_out() to understand what words are used.

Adjust parameters like max_features and ngram_range to suit your dataset size and task.