How to Use TfidfVectorizer in sklearn with Python
Use
TfidfVectorizer from sklearn.feature_extraction.text to convert text documents into TF-IDF feature vectors. Initialize it, then call fit_transform() on your text data to get the TF-IDF matrix.Syntax
The basic syntax to use TfidfVectorizer is:
TfidfVectorizer(): Creates the vectorizer object with optional parameters.fit_transform(texts): Learns vocabulary and idf from the texts, then transforms texts into TF-IDF features.transform(new_texts): Transforms new texts using the learned vocabulary and idf.
python
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(texts) new_tfidf = vectorizer.transform(new_texts)
Example
This example shows how to convert a list of text documents into a TF-IDF matrix and print the feature names and matrix values.
python
from sklearn.feature_extraction.text import TfidfVectorizer texts = [ "the cat sat on the mat", "the dog ate my homework", "cats and dogs are friends" ] vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(texts) print("Feature names:", vectorizer.get_feature_names_out()) print("TF-IDF matrix:\n", tfidf_matrix.toarray())
Output
Feature names: ['and' 'are' 'ate' 'cat' 'cats' 'dog' 'dogs' 'homework' 'mat' 'my' 'on' 'sat' 'the' 'friends']
TF-IDF matrix:
[[0. 0. 0. 0.57973867 0. 0. 0. 0. 0.81480247 0. 0.57973867 0.57973867 0.41482111 0. ]
[0. 0. 0.70710678 0. 0. 0.70710678 0. 0.70710678 0. 0.70710678 0. 0. 0.41482111 0. ]
[0.57735027 0.57735027 0. 0. 0.57735027 0. 0.57735027 0. 0. 0. 0. 0. 0.41482111 0.57735027]]
Common Pitfalls
Common mistakes when using TfidfVectorizer:
- Not fitting the vectorizer before transforming new data causes errors.
- Passing raw strings to
transform()without fitting first. - Ignoring stop words which can add noise; use
stop_words='english'to remove common words. - Not converting sparse matrix to array when printing or processing.
python
from sklearn.feature_extraction.text import TfidfVectorizer texts = ["hello world", "machine learning"] vectorizer = TfidfVectorizer() # Wrong: transform before fit # tfidf_matrix = vectorizer.transform(texts) # This will raise an error # Right way: tfidf_matrix = vectorizer.fit_transform(texts) print(tfidf_matrix.toarray())
Output
[[0.70710678 0.70710678]
[0.70710678 0.70710678]]
Quick Reference
Key parameters of TfidfVectorizer:
stop_words='english': Removes common English words.max_features=1000: Limits vocabulary size.ngram_range=(1,2): Includes unigrams and bigrams.max_df=0.8: Ignores words in more than 80% of documents.min_df=2: Ignores words in fewer than 2 documents.
Key Takeaways
Initialize TfidfVectorizer and call fit_transform on your text data to get TF-IDF features.
Use transform on new data only after fitting the vectorizer to training data.
Convert the sparse TF-IDF matrix to an array for easy viewing or processing.
Use parameters like stop_words and ngram_range to customize text processing.
Common errors include calling transform before fit and ignoring stop words.