How to use bag of words sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use Bag of Words with sklearn in Python

Use CountVectorizer from sklearn.feature_extraction.text to convert text data into a Bag of Words numeric matrix. Fit the vectorizer on your text data with fit_transform() to get word counts for each document.

📐

Syntax

The main class to create a Bag of Words model in sklearn is CountVectorizer. You initialize it, then use fit_transform() on your list of text documents to get a matrix of word counts.

CountVectorizer(): creates the vectorizer object.
fit_transform(texts): learns the vocabulary and returns the word count matrix.
get_feature_names_out(): returns the list of words (features) learned.

python

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(['sample text data'])
words = vectorizer.get_feature_names_out()

💻

Example

This example shows how to convert a list of sentences into a Bag of Words matrix using CountVectorizer. It prints the feature words and the matrix as an array.

python

from sklearn.feature_extraction.text import CountVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print('Feature words:', vectorizer.get_feature_names_out())
print('Bag of Words matrix:\n', X.toarray())

Output

Feature words: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i'] Bag of Words matrix: [[0 0 0 0 1 1 1 0 1] [0 1 0 1 1 0 1 0 0] [1 0 1 0 0 1 0 1 1]]

⚠️

Common Pitfalls

Common mistakes when using Bag of Words with sklearn include:

Not converting the sparse matrix to an array before printing, which shows unreadable output.
Ignoring case sensitivity; by default, CountVectorizer lowercases all text.
Not removing stop words, which can add many common words that don't help your model.
Feeding raw strings instead of a list of strings to fit_transform().

python

from sklearn.feature_extraction.text import CountVectorizer

texts = 'This is a single string, not a list.'

# Wrong: passing a string instead of list
# vectorizer = CountVectorizer()
# X = vectorizer.fit_transform(texts)  # This will treat each character as a token

# Right:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([texts])  # Pass a list of strings
print('Features:', vectorizer.get_feature_names_out())

Output

Features: ['a' 'is' 'list' 'not' 'single' 'string' 'this']

📊

Quick Reference

Summary tips for using Bag of Words with sklearn:

Use CountVectorizer() to create the model.
Call fit_transform() on a list of text documents.
Use get_feature_names_out() to see the vocabulary.
Convert the output sparse matrix to array with toarray() for easy viewing.
Consider parameters like stop_words='english' to remove common words.

✅

Key Takeaways

Use sklearn's CountVectorizer to convert text into a numeric Bag of Words matrix.

Always pass a list of strings to fit_transform, not a single string.

Use get_feature_names_out() to get the list of words learned.

Convert the sparse matrix to an array with toarray() to view counts clearly.

Consider removing stop words to improve model quality.