What is Bag of Words and TF-IDF in ML Python?

ML Pythonml~5 mins

Bag of Words and TF-IDF in ML Python

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Introduction

Bag of Words and TF-IDF help computers understand text by turning words into numbers. This makes it easier to find patterns in text.

When you want to classify emails as spam or not spam.

When you want to find the main topics in customer reviews.

When you want to compare documents to see how similar they are.

When you want to build a search engine that finds relevant documents.

When you want to analyze social media posts to understand trends.

Syntax

ML Python

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag of Words
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(documents)

# TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(documents)

CountVectorizer creates a matrix counting how many times each word appears.

TfidfVectorizer creates a matrix that weighs words by how important they are in a document compared to all documents.

Examples

This example shows Bag of Words counting words in two sentences.

ML Python

from sklearn.feature_extraction.text import CountVectorizer

docs = ['I love apples', 'You love oranges']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(X.toarray())

This example shows TF-IDF values for the same two sentences, highlighting important words.

ML Python

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['I love apples', 'You love oranges']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(X.toarray())

Sample Model

This program shows how to convert text documents into numbers using Bag of Words and TF-IDF. It prints the words found and the matrices representing the documents.

ML Python

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample documents
documents = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python',
    'Python coding is great'
]

# Bag of Words
count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(documents)
print('Bag of Words feature names:', count_vectorizer.get_feature_names_out())
print('Bag of Words matrix:\n', X_counts.toarray())

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)
print('\nTF-IDF feature names:', tfidf_vectorizer.get_feature_names_out())
print('TF-IDF matrix:\n', X_tfidf.toarray())

OutputSuccess

Important Notes

Bag of Words counts words but ignores word order and meaning.

TF-IDF helps reduce the importance of common words like 'is' or 'the'.

Both methods turn text into numbers so machine learning models can use them.

Summary

Bag of Words counts how often words appear in text.

TF-IDF weighs words by how unique they are across documents.

Both help computers understand and work with text data.