We turn words into numbers so computers can understand text. CountVectorizer and TF-IDF help us do this by counting words or measuring their importance.
Text feature basics (CountVectorizer, TF-IDF) in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # Create a CountVectorizer or TfidfVectorizer object vectorizer = CountVectorizer() # or TfidfVectorizer() # Fit and transform text data into numbers X = vectorizer.fit_transform(texts) # Get feature names (words) words = vectorizer.get_feature_names_out()
CountVectorizer counts how often each word appears.
TF-IDF gives more weight to important words and less to common ones.
from sklearn.feature_extraction.text import CountVectorizer texts = ["I love apples", "You love oranges"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print(vectorizer.get_feature_names_out()) print(X.toarray())
from sklearn.feature_extraction.text import TfidfVectorizer texts = ["I love apples", "You love oranges"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) print(vectorizer.get_feature_names_out()) print(X.toarray())
This program shows how to convert text into numbers using both CountVectorizer and TfidfVectorizer. It prints the words found and the numeric matrix for each method.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer texts = [ "I love machine learning", "Machine learning is fun", "I love coding" ] # Using CountVectorizer count_vectorizer = CountVectorizer() count_matrix = count_vectorizer.fit_transform(texts) count_words = count_vectorizer.get_feature_names_out() print("CountVectorizer feature names:", count_words) print("CountVectorizer matrix:\n", count_matrix.toarray()) # Using TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(texts) tfidf_words = tfidf_vectorizer.get_feature_names_out() print("\nTfidfVectorizer feature names:", tfidf_words) print("TfidfVectorizer matrix:\n", tfidf_matrix.toarray())
CountVectorizer creates simple counts of words, which is easy to understand.
TF-IDF helps highlight important words by reducing the weight of common words like 'is' or 'the'.
Both methods convert text into a matrix that machine learning models can use.
CountVectorizer counts how many times each word appears in text.
TF-IDF scores words by importance, not just frequency.
These tools help turn text into numbers for machine learning.
Practice
CountVectorizer do in text processing?Solution
Step 1: Understand CountVectorizer's role
CountVectorizer transforms text into a matrix of token counts, counting word occurrences.Step 2: Differentiate from TF-IDF
Unlike TF-IDF, it does not weigh words by importance, only counts frequency.Final Answer:
Counts how many times each word appears in the text -> Option BQuick Check:
CountVectorizer = word counts [OK]
- Confusing CountVectorizer with TF-IDF
- Thinking it removes stop words by default
- Assuming it normalizes text only
Solution
Step 1: Recall correct sklearn import path
CountVectorizer is in sklearn.feature_extraction.text module.Step 2: Check syntax correctness
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() uses correct import and instantiation syntax.Final Answer:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() -> Option AQuick Check:
Correct import path and syntax [OK]
- Using wrong module path for import
- Incorrect import syntax (like import ... from ...)
- Forgetting to instantiate the class
sentences = ["I love cats", "Cats love me"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(sentences) print(X.shape)
Solution
Step 1: Count unique words in sentences
Words are: 'i', 'love', 'cats', 'me' -> 4 unique words.Step 2: Understand shape of output matrix
There are 2 sentences (rows) and 4 unique words (columns), so shape is (2, 4).Final Answer:
(2, 4) -> Option AQuick Check:
Rows = sentences, columns = unique words [OK]
- Mixing rows and columns in shape
- Counting duplicate words multiple times
- Ignoring case sensitivity (CountVectorizer lowercases by default)
from sklearn.feature_extraction.text import TfidfVectorizer texts = ["apple banana apple", "banana fruit"] tfidf = TfidfVectorizer() X = tfidf.fit_transform(texts) print(tfidf.get_feature_names())
Solution
Step 1: Check method usage for feature names
In recent sklearn versions, get_feature_names() is deprecated.Step 2: Use updated method
Use get_feature_names_out() instead to get feature names without error.Final Answer:
get_feature_names() is deprecated, should use get_feature_names_out() -> Option CQuick Check:
Use get_feature_names_out() for TF-IDF features [OK]
- Using deprecated get_feature_names() method
- Passing wrong data type to fit_transform
- Incorrect import paths
Solution
Step 1: Understand the goal of reducing common word impact
Common words appear frequently but carry less meaning, so their impact should be lowered.Step 2: Identify method that weighs words by importance
TF-IDF scores words higher if they are rare and important, reducing common word impact.Final Answer:
TF-IDF Vectorizer -> Option DQuick Check:
TF-IDF = importance weighting [OK]
- Using raw counts which treat all words equally
- Assuming stop words removal alone solves importance
- Confusing one-hot encoding with frequency weighting
