CountVectorizer vs TfidfVectorizer in Python: Key Differences and Usage
sklearn, CountVectorizer converts text into a matrix of token counts, while TfidfVectorizer transforms text into a matrix of TF-IDF features that reflect word importance. TfidfVectorizer downweights common words, making it better for tasks needing word relevance rather than just frequency.Quick Comparison
This table summarizes the main differences between CountVectorizer and TfidfVectorizer in sklearn.
| Feature | CountVectorizer | TfidfVectorizer |
|---|---|---|
| Purpose | Counts word occurrences | Measures word importance with TF-IDF |
| Output | Matrix of raw token counts | Matrix of TF-IDF scores |
| Effect on common words | No downweighting | Downweights frequent words |
| Use case | Simple frequency-based features | Features emphasizing unique words |
| Computational cost | Lower | Slightly higher due to IDF calculation |
| Typical application | Bag-of-words models | Text classification and retrieval |
Key Differences
CountVectorizer creates a matrix where each entry is the count of a word in a document. It treats all words equally, so common words like "the" or "and" have high counts but no special weighting.
TfidfVectorizer builds on this by multiplying counts by the inverse document frequency (IDF). This means words that appear in many documents get lower scores, highlighting words that are more unique and informative for each document.
Because of this, TfidfVectorizer is often better for tasks like text classification or search, where distinguishing important words matters. CountVectorizer is simpler and faster, useful when raw frequency is enough or as a baseline.
Code Comparison
from sklearn.feature_extraction.text import CountVectorizer texts = ["the cat sat on the mat", "the dog ate my homework"] vectorizer = CountVectorizer() X_counts = vectorizer.fit_transform(texts) print("Feature names:", vectorizer.get_feature_names_out()) print("Count matrix:\n", X_counts.toarray())
TfidfVectorizer Equivalent
from sklearn.feature_extraction.text import TfidfVectorizer texts = ["the cat sat on the mat", "the dog ate my homework"] vectorizer = TfidfVectorizer() X_tfidf = vectorizer.fit_transform(texts) print("Feature names:", vectorizer.get_feature_names_out()) print("TF-IDF matrix:\n", X_tfidf.toarray())
When to Use Which
Choose CountVectorizer when you want a simple count of words and your model or task benefits from raw frequency, such as basic topic modeling or when speed is critical.
Choose TfidfVectorizer when you need to emphasize important words and reduce the impact of common words, which is common in text classification, search engines, and information retrieval.
In general, TfidfVectorizer often leads to better performance in machine learning tasks involving text because it captures word importance beyond just frequency.