How to Use CountVectorizer for Text in NLP
Use
CountVectorizer from sklearn.feature_extraction.text to convert text data into a matrix of token counts. It splits text into words, counts each word's frequency, and creates a numeric feature matrix for machine learning models.Syntax
The basic syntax to use CountVectorizer is:
CountVectorizer(): Creates the vectorizer object.fit(texts): Learns the vocabulary from the list of text documents.transform(texts): Converts texts into a matrix of token counts.fit_transform(texts): Combines fit and transform in one step.
You can customize tokenization, stop words, and other options when creating the vectorizer.
python
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() vectorizer.fit(texts) # Learn vocabulary matrix = vectorizer.transform(texts) # Transform texts to count matrix # Or combine: matrix = vectorizer.fit_transform(texts)
Example
This example shows how to convert a list of sentences into a count matrix and view the feature names and counts.
python
from sklearn.feature_extraction.text import CountVectorizer texts = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python' ] vectorizer = CountVectorizer() count_matrix = vectorizer.fit_transform(texts) print('Feature names:', vectorizer.get_feature_names_out()) print('Count matrix as array:\n', count_matrix.toarray())
Output
Feature names: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i']
Count matrix as array:
[[0 0 0 0 1 1 1 0 1]
[0 1 0 1 1 0 1 0 0]
[1 0 1 0 0 1 0 1 1]]
Common Pitfalls
Common mistakes when using CountVectorizer include:
- Not fitting the vectorizer before transforming new data.
- Ignoring case sensitivity; by default, it lowercases text.
- Not handling stop words, which can add noise.
- Assuming the output is dense; it returns a sparse matrix to save memory.
Always check the vocabulary and understand the output format.
python
from sklearn.feature_extraction.text import CountVectorizer texts_train = ['I love AI', 'AI is the future'] texts_test = ['I love the future'] vectorizer = CountVectorizer() vectorizer.fit(texts_train) # Fit on training data only # Wrong: fitting again on test data (loses training vocab) # vectorizer.fit(texts_test) # Correct: transform test data using trained vectorizer test_matrix = vectorizer.transform(texts_test) print('Vocabulary:', vectorizer.get_feature_names_out()) print('Test matrix:\n', test_matrix.toarray())
Output
Vocabulary: ['ai' 'future' 'is' 'love' 'the']
Test matrix:
[[0 1 0 1 1]]
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| stop_words | Remove common words like 'the', 'is' | None |
| lowercase | Convert all text to lowercase | True |
| max_features | Limit vocabulary size | None |
| ngram_range | Range of n-grams to extract (e.g., (1,2)) | (1,1) |
| max_df | Ignore words with high document frequency | 1.0 |
| min_df | Ignore words with low document frequency | 1 |
Key Takeaways
CountVectorizer converts text into a matrix of word counts for machine learning.
Always fit the vectorizer on training data before transforming new data.
Use parameters like stop_words and lowercase to clean and control features.
The output is a sparse matrix; convert to array only if needed for inspection.
Check vocabulary with get_feature_names_out() to understand extracted tokens.
