NlpHow-ToBeginner · 4 min read

How to Use CountVectorizer for Text in NLP

Use CountVectorizer from sklearn.feature_extraction.text to convert text data into a matrix of token counts. It splits text into words, counts each word's frequency, and creates a numeric feature matrix for machine learning models.

📐

Syntax

The basic syntax to use CountVectorizer is:

CountVectorizer(): Creates the vectorizer object.
fit(texts): Learns the vocabulary from the list of text documents.
transform(texts): Converts texts into a matrix of token counts.
fit_transform(texts): Combines fit and transform in one step.

You can customize tokenization, stop words, and other options when creating the vectorizer.

python

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(texts)  # Learn vocabulary
matrix = vectorizer.transform(texts)  # Transform texts to count matrix

# Or combine:
matrix = vectorizer.fit_transform(texts)

💻

Example

This example shows how to convert a list of sentences into a count matrix and view the feature names and counts.

python

from sklearn.feature_extraction.text import CountVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(texts)

print('Feature names:', vectorizer.get_feature_names_out())
print('Count matrix as array:\n', count_matrix.toarray())

Output

Feature names: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i'] Count matrix as array: [[0 0 0 0 1 1 1 0 1] [0 1 0 1 1 0 1 0 0] [1 0 1 0 0 1 0 1 1]]

⚠️

Common Pitfalls

Common mistakes when using CountVectorizer include:

Not fitting the vectorizer before transforming new data.
Ignoring case sensitivity; by default, it lowercases text.
Not handling stop words, which can add noise.
Assuming the output is dense; it returns a sparse matrix to save memory.

Always check the vocabulary and understand the output format.

python

from sklearn.feature_extraction.text import CountVectorizer

texts_train = ['I love AI', 'AI is the future']
texts_test = ['I love the future']

vectorizer = CountVectorizer()
vectorizer.fit(texts_train)  # Fit on training data only

# Wrong: fitting again on test data (loses training vocab)
# vectorizer.fit(texts_test)

# Correct: transform test data using trained vectorizer
test_matrix = vectorizer.transform(texts_test)

print('Vocabulary:', vectorizer.get_feature_names_out())
print('Test matrix:\n', test_matrix.toarray())

Output

Vocabulary: ['ai' 'future' 'is' 'love' 'the'] Test matrix: [[0 1 0 1 1]]

📊

Quick Reference

Parameter	Description	Default
stop_words	Remove common words like 'the', 'is'	None
lowercase	Convert all text to lowercase	True
max_features	Limit vocabulary size	None
ngram_range	Range of n-grams to extract (e.g., (1,2))	(1,1)
max_df	Ignore words with high document frequency	1.0
min_df	Ignore words with low document frequency	1

✅

Key Takeaways

CountVectorizer converts text into a matrix of word counts for machine learning.

Always fit the vectorizer on training data before transforming new data.

Use parameters like stop_words and lowercase to clean and control features.

The output is a sparse matrix; convert to array only if needed for inspection.

Check vocabulary with get_feature_names_out() to understand extracted tokens.