How to Use TfidfVectorizer for Text in NLP
Use
TfidfVectorizer from sklearn.feature_extraction.text to convert text data into numerical features by calculating term frequency-inverse document frequency (TF-IDF). Fit the vectorizer on your text data with fit_transform() to get a matrix of TF-IDF features ready for machine learning models.Syntax
The TfidfVectorizer is initialized with optional parameters to control text processing. Use fit() to learn vocabulary and IDF from training data, and transform() to convert new text. fit_transform() does both in one step.
TfidfVectorizer(): Creates the vectorizer object.fit(texts): Learns vocabulary and IDF from texts.transform(texts): Converts texts to TF-IDF features.fit_transform(texts): Fits and transforms texts in one step.
python
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer( stop_words='english', # remove common English words max_features=1000, # limit to top 1000 features ngram_range=(1, 2) # use unigrams and bigrams ) # Fit and transform training texts X = vectorizer.fit_transform(texts) # Transform new texts X_new = vectorizer.transform(new_texts)
Example
This example shows how to convert a list of text documents into a TF-IDF feature matrix using TfidfVectorizer. It prints the feature names and the TF-IDF values for each document.
python
from sklearn.feature_extraction.text import TfidfVectorizer texts = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python' ] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(texts) print('Feature names:', vectorizer.get_feature_names_out()) print('TF-IDF matrix:\n', X.toarray())
Output
Feature names: ['coding' 'fun' 'learning' 'love' 'machine' 'python']
TF-IDF matrix:
[[0. 0. 0.70710678 0.70710678 0. 0. ]
[0. 0.70710678 0.5 0. 0.5 0. ]
[0.70710678 0. 0. 0.70710678 0. 0.70710678]]
Common Pitfalls
- Not removing stop words can cause common words to dominate features.
- Using
fit_transformon test data instead oftransformcauses data leakage. - Ignoring text preprocessing like lowercasing or punctuation removal may reduce quality.
- Setting
max_featurestoo low can lose important words.
python
from sklearn.feature_extraction.text import TfidfVectorizer texts_train = ['I love machine learning', 'Machine learning is fun'] texts_test = ['I love coding'] # Wrong: fitting on test data causes leakage vectorizer_wrong = TfidfVectorizer() X_train_wrong = vectorizer_wrong.fit_transform(texts_train) X_test_wrong = vectorizer_wrong.fit_transform(texts_test) # wrong # Right: fit on train, transform on test vectorizer_right = TfidfVectorizer() X_train_right = vectorizer_right.fit_transform(texts_train) X_test_right = vectorizer_right.transform(texts_test) # correct
Quick Reference
Here is a quick summary of key TfidfVectorizer parameters and methods:
| Parameter / Method | Description |
|---|---|
| stop_words | Remove common words like 'the', 'and' (e.g., 'english') |
| max_features | Limit number of features to top N by frequency |
| ngram_range | Tuple (min_n, max_n) to include n-grams, e.g., (1,2) for unigrams and bigrams |
| fit(texts) | Learn vocabulary and IDF from texts |
| transform(texts) | Convert texts to TF-IDF features using learned vocabulary |
| fit_transform(texts) | Fit and transform texts in one step |
| get_feature_names_out() | Get list of feature names (words or n-grams) |
Key Takeaways
Use TfidfVectorizer to convert text into numerical features based on word importance.
Always fit the vectorizer on training data and transform test data to avoid data leakage.
Remove stop words and consider n-grams to improve feature quality.
Check feature names with get_feature_names_out() to understand what words are used.
Adjust parameters like max_features and ngram_range to suit your dataset size and task.
