0
0
MlopsHow-ToBeginner · 3 min read

How to Use Bag of Words with sklearn in Python

Use CountVectorizer from sklearn.feature_extraction.text to convert text data into a Bag of Words numeric matrix. Fit the vectorizer on your text data with fit_transform() to get word counts for each document.
📐

Syntax

The main class to create a Bag of Words model in sklearn is CountVectorizer. You initialize it, then use fit_transform() on your list of text documents to get a matrix of word counts.

  • CountVectorizer(): creates the vectorizer object.
  • fit_transform(texts): learns the vocabulary and returns the word count matrix.
  • get_feature_names_out(): returns the list of words (features) learned.
python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(['sample text data'])
words = vectorizer.get_feature_names_out()
💻

Example

This example shows how to convert a list of sentences into a Bag of Words matrix using CountVectorizer. It prints the feature words and the matrix as an array.

python
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print('Feature words:', vectorizer.get_feature_names_out())
print('Bag of Words matrix:\n', X.toarray())
Output
Feature words: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i'] Bag of Words matrix: [[0 0 0 0 1 1 1 0 1] [0 1 0 1 1 0 1 0 0] [1 0 1 0 0 1 0 1 1]]
⚠️

Common Pitfalls

Common mistakes when using Bag of Words with sklearn include:

  • Not converting the sparse matrix to an array before printing, which shows unreadable output.
  • Ignoring case sensitivity; by default, CountVectorizer lowercases all text.
  • Not removing stop words, which can add many common words that don't help your model.
  • Feeding raw strings instead of a list of strings to fit_transform().
python
from sklearn.feature_extraction.text import CountVectorizer

texts = 'This is a single string, not a list.'

# Wrong: passing a string instead of list
# vectorizer = CountVectorizer()
# X = vectorizer.fit_transform(texts)  # This will treat each character as a token

# Right:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([texts])  # Pass a list of strings
print('Features:', vectorizer.get_feature_names_out())
Output
Features: ['a' 'is' 'list' 'not' 'single' 'string' 'this']
📊

Quick Reference

Summary tips for using Bag of Words with sklearn:

  • Use CountVectorizer() to create the model.
  • Call fit_transform() on a list of text documents.
  • Use get_feature_names_out() to see the vocabulary.
  • Convert the output sparse matrix to array with toarray() for easy viewing.
  • Consider parameters like stop_words='english' to remove common words.

Key Takeaways

Use sklearn's CountVectorizer to convert text into a numeric Bag of Words matrix.
Always pass a list of strings to fit_transform, not a single string.
Use get_feature_names_out() to get the list of words learned.
Convert the sparse matrix to an array with toarray() to view counts clearly.
Consider removing stop words to improve model quality.