How to Use TF-IDF in Python with sklearn
Use
TfidfVectorizer from sklearn.feature_extraction.text to convert text data into TF-IDF features. Fit the vectorizer on your text data with fit_transform() to get the TF-IDF matrix representing word importance.Syntax
The main class to use is TfidfVectorizer. You create an instance, then call fit_transform() on your list of text documents to get the TF-IDF matrix. You can also use transform() on new data after fitting.
TfidfVectorizer(): Initializes the vectorizer with optional parameters likestop_wordsormax_features.fit_transform(texts): Learns vocabulary and idf, then transforms texts to TF-IDF features.transform(texts): Transforms new texts using the learned vocabulary and idf.
python
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(stop_words='english', max_features=1000) tfidf_matrix = vectorizer.fit_transform(texts) # texts is a list of strings # To transform new texts after fitting: new_tfidf = vectorizer.transform(new_texts)
Example
This example shows how to create a TF-IDF matrix from a small list of text documents. It prints the feature names and the TF-IDF values for each document.
python
from sklearn.feature_extraction.text import TfidfVectorizer texts = [ 'the cat sat on the mat', 'the dog sat on the log', 'cats and dogs are friends' ] vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(texts) print('Feature names:', vectorizer.get_feature_names_out()) print('TF-IDF matrix shape:', tfidf_matrix.shape) # Convert sparse matrix to dense and print print('TF-IDF matrix (dense):') print(tfidf_matrix.todense())
Output
Feature names: ['cat' 'cats' 'dog' 'dogs' 'friends' 'log' 'mat' 'sat']
TF-IDF matrix shape: (3, 8)
TF-IDF matrix (dense):
[[0.70710678 0. 0. 0. 0. 0.
0.70710678 0.70710678]
[0. 0. 0.70710678 0. 0. 0.70710678
0. 0.70710678]
[0. 0.57735027 0. 0.57735027 0.57735027 0.
0. 0. ]]
Common Pitfalls
- Not removing stop words can cause common words like "the" or "and" to dominate the features.
- Using
fit_transform()on training data but forgetting to usetransform()on test data leads to inconsistent features. - Not converting the sparse matrix to dense before printing can cause confusing output.
- Setting
max_featurestoo low may exclude important words.
python
from sklearn.feature_extraction.text import TfidfVectorizer texts_train = ['apple orange banana', 'banana fruit apple'] texts_test = ['apple fruit'] # Wrong: fitting separately on train and test vectorizer_wrong = TfidfVectorizer() train_matrix_wrong = vectorizer_wrong.fit_transform(texts_train) test_matrix_wrong = vectorizer_wrong.fit_transform(texts_test) # Wrong: should use transform() # Right: fit on train, transform on test vectorizer_right = TfidfVectorizer() train_matrix_right = vectorizer_right.fit_transform(texts_train) test_matrix_right = vectorizer_right.transform(texts_test)
Quick Reference
Remember these key points when using TF-IDF in Python:
- Use
TfidfVectorizerto convert text to TF-IDF features. - Call
fit_transform()on training data andtransform()on new data. - Use
stop_words='english'to remove common words. - Access feature names with
get_feature_names_out(). - TF-IDF output is a sparse matrix; convert to dense with
todense()if needed.
Key Takeaways
Use sklearn's TfidfVectorizer to convert text data into TF-IDF features easily.
Always fit the vectorizer on training data and transform new data with the same vectorizer.
Remove stop words to focus on meaningful words using the stop_words parameter.
TF-IDF output is a sparse matrix; convert it to dense for easier inspection if needed.
Avoid fitting the vectorizer separately on train and test data to keep consistent features.