How to Use CountVectorizer in sklearn with Python
Use
CountVectorizer from sklearn.feature_extraction.text to convert text into a matrix of token counts. First, create a CountVectorizer object, then call fit_transform() on your text data to get the numeric features.Syntax
The basic syntax to use CountVectorizer is:
CountVectorizer(): Creates the vectorizer object with optional parameters likestop_wordsormax_features.fit_transform(texts): Learns the vocabulary from the list of texts and transforms them into a matrix of token counts.get_feature_names_out(): Returns the list of tokens (words) learned.
python
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(stop_words='english', max_features=1000) X = vectorizer.fit_transform(texts) features = vectorizer.get_feature_names_out()
Example
This example shows how to convert a list of text documents into a matrix of word counts using CountVectorizer. It also prints the feature names and the resulting matrix.
python
from sklearn.feature_extraction.text import CountVectorizer texts = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python' ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print('Feature names:', vectorizer.get_feature_names_out()) print('Count matrix:\n', X.toarray())
Output
Feature names: ['coding' 'fun' 'in' 'is' 'learning' 'love' 'machine' 'python' 'i']
Count matrix:
[[0 0 0 0 1 1 1 0 1]
[0 1 0 1 1 0 1 0 0]
[1 0 1 0 0 1 0 1 1]]
Common Pitfalls
Common mistakes when using CountVectorizer include:
- Not converting the sparse matrix to an array before printing, which can be confusing.
- Ignoring stop words that may skew results; use
stop_words='english'to remove common words. - Not fitting the vectorizer before transforming new data, which causes errors.
- Assuming the order of features is the same across different vectorizers or datasets.
python
from sklearn.feature_extraction.text import CountVectorizer texts_train = ['I love AI', 'AI is the future'] texts_test = ['I love the future'] vectorizer = CountVectorizer(stop_words='english') X_train = vectorizer.fit_transform(texts_train) # Wrong: transforming test data without fitting # X_test = CountVectorizer().transform(texts_test) # This will fail # Right: use the same fitted vectorizer X_test = vectorizer.transform(texts_test) print(X_test.toarray())
Output
[[1 1]]
Quick Reference
| Method/Parameter | Description |
|---|---|
| CountVectorizer() | Create vectorizer with optional parameters like stop_words, max_features |
| fit_transform(texts) | Learn vocabulary and transform texts to count matrix |
| transform(new_texts) | Transform new texts using learned vocabulary |
| get_feature_names_out() | Get list of tokens (words) learned |
| stop_words='english' | Remove common English stop words |
| max_features=1000 | Limit vocabulary size to top 1000 tokens |
Key Takeaways
Create a CountVectorizer object and call fit_transform on your text data to get token counts.
Use get_feature_names_out() to see which words correspond to each column in the output matrix.
Always fit the vectorizer on training data before transforming new data to avoid errors.
Consider removing stop words to improve feature quality using stop_words='english'.
The output is a sparse matrix; convert it to an array with toarray() for easy viewing.