0
0
ML Pythonml~5 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python

Choose your learning style9 modes available
Introduction

We turn words into numbers so computers can understand text. CountVectorizer and TF-IDF help us do this by counting words or measuring their importance.

When you want to analyze customer reviews to find common words.
When building a spam filter to detect unwanted emails.
When summarizing news articles by important words.
When clustering similar documents based on their text.
When preparing text data for machine learning models.
Syntax
ML Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Create a CountVectorizer or TfidfVectorizer object
vectorizer = CountVectorizer()  # or TfidfVectorizer()

# Fit and transform text data into numbers
X = vectorizer.fit_transform(texts)

# Get feature names (words)
words = vectorizer.get_feature_names_out()

CountVectorizer counts how often each word appears.

TF-IDF gives more weight to important words and less to common ones.

Examples
This counts words in two sentences and shows the word list and counts.
ML Python
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love apples", "You love oranges"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
This calculates TF-IDF scores for the same sentences, showing word importance.
ML Python
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["I love apples", "You love oranges"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Sample Model

This program shows how to convert text into numbers using both CountVectorizer and TfidfVectorizer. It prints the words found and the numeric matrix for each method.

ML Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

texts = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding"
]

# Using CountVectorizer
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(texts)
count_words = count_vectorizer.get_feature_names_out()

print("CountVectorizer feature names:", count_words)
print("CountVectorizer matrix:\n", count_matrix.toarray())

# Using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
tfidf_words = tfidf_vectorizer.get_feature_names_out()

print("\nTfidfVectorizer feature names:", tfidf_words)
print("TfidfVectorizer matrix:\n", tfidf_matrix.toarray())
OutputSuccess
Important Notes

CountVectorizer creates simple counts of words, which is easy to understand.

TF-IDF helps highlight important words by reducing the weight of common words like 'is' or 'the'.

Both methods convert text into a matrix that machine learning models can use.

Summary

CountVectorizer counts how many times each word appears in text.

TF-IDF scores words by importance, not just frequency.

These tools help turn text into numbers for machine learning.