We turn words into numbers so computers can understand text. CountVectorizer and TF-IDF help us do this by counting words or measuring their importance.
0
0
Text feature basics (CountVectorizer, TF-IDF) in ML Python
Introduction
When you want to analyze customer reviews to find common words.
When building a spam filter to detect unwanted emails.
When summarizing news articles by important words.
When clustering similar documents based on their text.
When preparing text data for machine learning models.
Syntax
ML Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # Create a CountVectorizer or TfidfVectorizer object vectorizer = CountVectorizer() # or TfidfVectorizer() # Fit and transform text data into numbers X = vectorizer.fit_transform(texts) # Get feature names (words) words = vectorizer.get_feature_names_out()
CountVectorizer counts how often each word appears.
TF-IDF gives more weight to important words and less to common ones.
Examples
This counts words in two sentences and shows the word list and counts.
ML Python
from sklearn.feature_extraction.text import CountVectorizer texts = ["I love apples", "You love oranges"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print(vectorizer.get_feature_names_out()) print(X.toarray())
This calculates TF-IDF scores for the same sentences, showing word importance.
ML Python
from sklearn.feature_extraction.text import TfidfVectorizer texts = ["I love apples", "You love oranges"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) print(vectorizer.get_feature_names_out()) print(X.toarray())
Sample Model
This program shows how to convert text into numbers using both CountVectorizer and TfidfVectorizer. It prints the words found and the numeric matrix for each method.
ML Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer texts = [ "I love machine learning", "Machine learning is fun", "I love coding" ] # Using CountVectorizer count_vectorizer = CountVectorizer() count_matrix = count_vectorizer.fit_transform(texts) count_words = count_vectorizer.get_feature_names_out() print("CountVectorizer feature names:", count_words) print("CountVectorizer matrix:\n", count_matrix.toarray()) # Using TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(texts) tfidf_words = tfidf_vectorizer.get_feature_names_out() print("\nTfidfVectorizer feature names:", tfidf_words) print("TfidfVectorizer matrix:\n", tfidf_matrix.toarray())
OutputSuccess
Important Notes
CountVectorizer creates simple counts of words, which is easy to understand.
TF-IDF helps highlight important words by reducing the weight of common words like 'is' or 'the'.
Both methods convert text into a matrix that machine learning models can use.
Summary
CountVectorizer counts how many times each word appears in text.
TF-IDF scores words by importance, not just frequency.
These tools help turn text into numbers for machine learning.