ML Pythonml~5 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

We turn words into numbers so computers can understand text. CountVectorizer and TF-IDF help us do this by counting words or measuring their importance.

When you want to analyze customer reviews to find common words.

When building a spam filter to detect unwanted emails.

When summarizing news articles by important words.

When clustering similar documents based on their text.

When preparing text data for machine learning models.

Syntax

ML Python

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Create a CountVectorizer or TfidfVectorizer object
vectorizer = CountVectorizer()  # or TfidfVectorizer()

# Fit and transform text data into numbers
X = vectorizer.fit_transform(texts)

# Get feature names (words)
words = vectorizer.get_feature_names_out()

CountVectorizer counts how often each word appears.

TF-IDF gives more weight to important words and less to common ones.

Examples

This counts words in two sentences and shows the word list and counts.

ML Python

from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love apples", "You love oranges"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())

This calculates TF-IDF scores for the same sentences, showing word importance.

ML Python

from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["I love apples", "You love oranges"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())

Sample Model

This program shows how to convert text into numbers using both CountVectorizer and TfidfVectorizer. It prints the words found and the numeric matrix for each method.

ML Python

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

texts = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding"
]

# Using CountVectorizer
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(texts)
count_words = count_vectorizer.get_feature_names_out()

print("CountVectorizer feature names:", count_words)
print("CountVectorizer matrix:\n", count_matrix.toarray())

# Using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
tfidf_words = tfidf_vectorizer.get_feature_names_out()

print("\nTfidfVectorizer feature names:", tfidf_words)
print("TfidfVectorizer matrix:\n", tfidf_matrix.toarray())

OutputSuccess

Important Notes

CountVectorizer creates simple counts of words, which is easy to understand.

TF-IDF helps highlight important words by reducing the weight of common words like 'is' or 'the'.

Both methods convert text into a matrix that machine learning models can use.

Summary

CountVectorizer counts how many times each word appears in text.

TF-IDF scores words by importance, not just frequency.

These tools help turn text into numbers for machine learning.

Practice

(1/5)

1. What does CountVectorizer do in text processing?

easy

A. Calculates the importance of words based on frequency and rarity

B. Counts how many times each word appears in the text

C. Removes stop words from the text

D. Converts text into lowercase only

Text feature basics (CountVectorizer, TF-IDF) in ML Python

Start learning this pattern below

Practice

Solution

Step 1: Understand CountVectorizer's role

Step 2: Differentiate from TF-IDF

Final Answer:

Quick Check:

Solution

Step 1: Recall correct sklearn import path

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Count unique words in sentences

Step 2: Understand shape of output matrix

Final Answer:

Quick Check:

Solution

Step 1: Check method usage for feature names

Step 2: Use updated method

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal of reducing common word impact

Step 2: Identify method that weighs words by importance

Final Answer:

Quick Check: