How to Use Bag of Words with sklearn in Python
Use
CountVectorizer from sklearn.feature_extraction.text to convert text data into a Bag of Words numeric matrix. Fit the vectorizer on your text data with fit_transform() to get word counts for each document.Syntax
The main class to create a Bag of Words model in sklearn is CountVectorizer. You initialize it, then use fit_transform() on your list of text documents to get a matrix of word counts.
CountVectorizer(): creates the vectorizer object.fit_transform(texts): learns the vocabulary and returns the word count matrix.get_feature_names_out(): returns the list of words (features) learned.
python
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(['sample text data']) words = vectorizer.get_feature_names_out()
Example
This example shows how to convert a list of sentences into a Bag of Words matrix using CountVectorizer. It prints the feature words and the matrix as an array.
python
from sklearn.feature_extraction.text import CountVectorizer texts = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python' ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print('Feature words:', vectorizer.get_feature_names_out()) print('Bag of Words matrix:\n', X.toarray())
Output
Feature words: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i']
Bag of Words matrix:
[[0 0 0 0 1 1 1 0 1]
[0 1 0 1 1 0 1 0 0]
[1 0 1 0 0 1 0 1 1]]
Common Pitfalls
Common mistakes when using Bag of Words with sklearn include:
- Not converting the sparse matrix to an array before printing, which shows unreadable output.
- Ignoring case sensitivity; by default,
CountVectorizerlowercases all text. - Not removing stop words, which can add many common words that don't help your model.
- Feeding raw strings instead of a list of strings to
fit_transform().
python
from sklearn.feature_extraction.text import CountVectorizer texts = 'This is a single string, not a list.' # Wrong: passing a string instead of list # vectorizer = CountVectorizer() # X = vectorizer.fit_transform(texts) # This will treat each character as a token # Right: vectorizer = CountVectorizer() X = vectorizer.fit_transform([texts]) # Pass a list of strings print('Features:', vectorizer.get_feature_names_out())
Output
Features: ['a' 'is' 'list' 'not' 'single' 'string' 'this']
Quick Reference
Summary tips for using Bag of Words with sklearn:
- Use
CountVectorizer()to create the model. - Call
fit_transform()on a list of text documents. - Use
get_feature_names_out()to see the vocabulary. - Convert the output sparse matrix to array with
toarray()for easy viewing. - Consider parameters like
stop_words='english'to remove common words.
Key Takeaways
Use sklearn's CountVectorizer to convert text into a numeric Bag of Words matrix.
Always pass a list of strings to fit_transform, not a single string.
Use get_feature_names_out() to get the list of words learned.
Convert the sparse matrix to an array with toarray() to view counts clearly.
Consider removing stop words to improve model quality.