Bag of Words and TF-IDF help computers understand text by turning words into numbers. This makes it easier to find patterns in text.
0
0
Bag of Words and TF-IDF in ML Python
Introduction
When you want to classify emails as spam or not spam.
When you want to find the main topics in customer reviews.
When you want to compare documents to see how similar they are.
When you want to build a search engine that finds relevant documents.
When you want to analyze social media posts to understand trends.
Syntax
ML Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # Bag of Words vectorizer = CountVectorizer() X_train_counts = vectorizer.fit_transform(documents) # TF-IDF vectorizer = TfidfVectorizer() X_train_tfidf = vectorizer.fit_transform(documents)
CountVectorizer creates a matrix counting how many times each word appears.
TfidfVectorizer creates a matrix that weighs words by how important they are in a document compared to all documents.
Examples
This example shows Bag of Words counting words in two sentences.
ML Python
from sklearn.feature_extraction.text import CountVectorizer docs = ['I love apples', 'You love oranges'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(docs) print(vectorizer.get_feature_names_out()) print(X.toarray())
This example shows TF-IDF values for the same two sentences, highlighting important words.
ML Python
from sklearn.feature_extraction.text import TfidfVectorizer docs = ['I love apples', 'You love oranges'] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs) print(vectorizer.get_feature_names_out()) print(X.toarray())
Sample Model
This program shows how to convert text documents into numbers using Bag of Words and TF-IDF. It prints the words found and the matrices representing the documents.
ML Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # Sample documents documents = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python', 'Python coding is great' ] # Bag of Words count_vectorizer = CountVectorizer() X_counts = count_vectorizer.fit_transform(documents) print('Bag of Words feature names:', count_vectorizer.get_feature_names_out()) print('Bag of Words matrix:\n', X_counts.toarray()) # TF-IDF tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(documents) print('\nTF-IDF feature names:', tfidf_vectorizer.get_feature_names_out()) print('TF-IDF matrix:\n', X_tfidf.toarray())
OutputSuccess
Important Notes
Bag of Words counts words but ignores word order and meaning.
TF-IDF helps reduce the importance of common words like 'is' or 'the'.
Both methods turn text into numbers so machine learning models can use them.
Summary
Bag of Words counts how often words appear in text.
TF-IDF weighs words by how unique they are across documents.
Both help computers understand and work with text data.