0
0
MlopsHow-ToBeginner · 3 min read

How to Use CountVectorizer in sklearn with Python

Use CountVectorizer from sklearn.feature_extraction.text to convert text into a matrix of token counts. First, create a CountVectorizer object, then call fit_transform() on your text data to get the numeric features.
📐

Syntax

The basic syntax to use CountVectorizer is:

  • CountVectorizer(): Creates the vectorizer object with optional parameters like stop_words or max_features.
  • fit_transform(texts): Learns the vocabulary from the list of texts and transforms them into a matrix of token counts.
  • get_feature_names_out(): Returns the list of tokens (words) learned.
python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(texts)
features = vectorizer.get_feature_names_out()
💻

Example

This example shows how to convert a list of text documents into a matrix of word counts using CountVectorizer. It also prints the feature names and the resulting matrix.

python
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print('Feature names:', vectorizer.get_feature_names_out())
print('Count matrix:\n', X.toarray())
Output
Feature names: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i'] Count matrix: [[0 0 0 0 1 1 1 0 1] [0 1 0 1 1 0 1 0 0] [1 0 1 0 0 1 0 1 1]]
⚠️

Common Pitfalls

Common mistakes when using CountVectorizer include:

  • Not converting the sparse matrix to an array before printing, which can be confusing.
  • Ignoring stop words that may skew results; use stop_words='english' to remove common words.
  • Not fitting the vectorizer before transforming new data, which causes errors.
  • Assuming the order of features is the same across different vectorizers or datasets.
python
from sklearn.feature_extraction.text import CountVectorizer

texts_train = ['I love AI', 'AI is the future']
texts_test = ['I love the future']

vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(texts_train)

# Wrong: transforming test data without fitting
# X_test = CountVectorizer().transform(texts_test)  # This will fail

# Right: use the same fitted vectorizer
X_test = vectorizer.transform(texts_test)

print(X_test.toarray())
Output
[[1 1]]
📊

Quick Reference

Method/ParameterDescription
CountVectorizer()Create vectorizer with optional parameters like stop_words, max_features
fit_transform(texts)Learn vocabulary and transform texts to count matrix
transform(new_texts)Transform new texts using learned vocabulary
get_feature_names_out()Get list of tokens (words) learned
stop_words='english'Remove common English stop words
max_features=1000Limit vocabulary size to top 1000 tokens

Key Takeaways

Create a CountVectorizer object and call fit_transform on your text data to get token counts.
Use get_feature_names_out() to see which words correspond to each column in the output matrix.
Always fit the vectorizer on training data before transforming new data to avoid errors.
Consider removing stop words to improve feature quality using stop_words='english'.
The output is a sparse matrix; convert it to an array with toarray() for easy viewing.