Bird
Raised Fist0
NlpHow-ToBeginner · 4 min read

How to Use CountVectorizer for Text in NLP

Use CountVectorizer from sklearn.feature_extraction.text to convert text data into a matrix of token counts. It splits text into words, counts each word's frequency, and creates a numeric feature matrix for machine learning models.
📐

Syntax

The basic syntax to use CountVectorizer is:

  • CountVectorizer(): Creates the vectorizer object.
  • fit(texts): Learns the vocabulary from the list of text documents.
  • transform(texts): Converts texts into a matrix of token counts.
  • fit_transform(texts): Combines fit and transform in one step.

You can customize tokenization, stop words, and other options when creating the vectorizer.

python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(texts)  # Learn vocabulary
matrix = vectorizer.transform(texts)  # Transform texts to count matrix

# Or combine:
matrix = vectorizer.fit_transform(texts)
💻

Example

This example shows how to convert a list of sentences into a count matrix and view the feature names and counts.

python
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(texts)

print('Feature names:', vectorizer.get_feature_names_out())
print('Count matrix as array:\n', count_matrix.toarray())
Output
Feature names: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i'] Count matrix as array: [[0 0 0 0 1 1 1 0 1] [0 1 0 1 1 0 1 0 0] [1 0 1 0 0 1 0 1 1]]
⚠️

Common Pitfalls

Common mistakes when using CountVectorizer include:

  • Not fitting the vectorizer before transforming new data.
  • Ignoring case sensitivity; by default, it lowercases text.
  • Not handling stop words, which can add noise.
  • Assuming the output is dense; it returns a sparse matrix to save memory.

Always check the vocabulary and understand the output format.

python
from sklearn.feature_extraction.text import CountVectorizer

texts_train = ['I love AI', 'AI is the future']
texts_test = ['I love the future']

vectorizer = CountVectorizer()
vectorizer.fit(texts_train)  # Fit on training data only

# Wrong: fitting again on test data (loses training vocab)
# vectorizer.fit(texts_test)

# Correct: transform test data using trained vectorizer
test_matrix = vectorizer.transform(texts_test)

print('Vocabulary:', vectorizer.get_feature_names_out())
print('Test matrix:\n', test_matrix.toarray())
Output
Vocabulary: ['ai' 'future' 'is' 'love' 'the'] Test matrix: [[0 1 0 1 1]]
📊

Quick Reference

ParameterDescriptionDefault
stop_wordsRemove common words like 'the', 'is'None
lowercaseConvert all text to lowercaseTrue
max_featuresLimit vocabulary sizeNone
ngram_rangeRange of n-grams to extract (e.g., (1,2))(1,1)
max_dfIgnore words with high document frequency1.0
min_dfIgnore words with low document frequency1

Key Takeaways

CountVectorizer converts text into a matrix of word counts for machine learning.
Always fit the vectorizer on training data before transforming new data.
Use parameters like stop_words and lowercase to clean and control features.
The output is a sparse matrix; convert to array only if needed for inspection.
Check vocabulary with get_feature_names_out() to understand extracted tokens.