Practice

(1/5)

1. What does CountVectorizer do in text processing?

easy

A. Calculates the importance of words based on frequency and rarity

B. Counts how many times each word appears in the text

C. Removes stop words from the text

D. Converts text into lowercase only

Solution

Step 1: Understand CountVectorizer's role
CountVectorizer transforms text into a matrix of token counts, counting word occurrences.
Step 2: Differentiate from TF-IDF
Unlike TF-IDF, it does not weigh words by importance, only counts frequency.
Final Answer:
Counts how many times each word appears in the text -> Option B
Quick Check:
CountVectorizer = word counts [OK]

Hint: CountVectorizer counts words, TF-IDF scores importance [OK]

Common Mistakes:

Confusing CountVectorizer with TF-IDF
Thinking it removes stop words by default
Assuming it normalizes text only

2. Which of the following is the correct way to import and create a CountVectorizer in Python?

easy

A. from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()

B. import CountVectorizer from sklearn.text vectorizer = CountVectorizer()

C. from sklearn.text import CountVectorizer vectorizer = CountVectorizer()

D. import CountVectorizer vectorizer = CountVectorizer()

Solution

Step 1: Recall correct sklearn import path
CountVectorizer is in sklearn.feature_extraction.text module.
Step 2: Check syntax correctness
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() uses correct import and instantiation syntax.
Final Answer:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() -> Option A
Quick Check:
Correct import path and syntax [OK]

Hint: CountVectorizer is in sklearn.feature_extraction.text [OK]

Common Mistakes:

Using wrong module path for import
Incorrect import syntax (like import ... from ...)
Forgetting to instantiate the class

3. What will be the output shape of the matrix after applying CountVectorizer on these two sentences?

sentences = ["I love cats", "Cats love me"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
print(X.shape)

medium

A. (2, 4)

B. (2, 3)

C. (3, 2)

D. (4, 2)

Solution

Step 1: Count unique words in sentences
Words are: 'i', 'love', 'cats', 'me' -> 4 unique words.
Step 2: Understand shape of output matrix
There are 2 sentences (rows) and 4 unique words (columns), so shape is (2, 4).
Final Answer:
(2, 4) -> Option A
Quick Check:
Rows = sentences, columns = unique words [OK]

Hint: Shape = (number of texts, unique words) [OK]

Common Mistakes:

Mixing rows and columns in shape
Counting duplicate words multiple times
Ignoring case sensitivity (CountVectorizer lowercases by default)

4. Identify the error in this TF-IDF code snippet:

from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["apple banana apple", "banana fruit"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
print(tfidf.get_feature_names())

medium

A. fit_transform() should be called on texts as a string, not list

B. TfidfVectorizer() requires stop_words parameter

C. get_feature_names() is deprecated, should use get_feature_names_out()

D. Import statement is incorrect

Solution

Step 1: Check method usage for feature names
In recent sklearn versions, get_feature_names() is deprecated.
Step 2: Use updated method
Use get_feature_names_out() instead to get feature names without error.
Final Answer:
get_feature_names() is deprecated, should use get_feature_names_out() -> Option C
Quick Check:
Use get_feature_names_out() for TF-IDF features [OK]

Hint: Use get_feature_names_out() with TF-IDF [OK]

Common Mistakes:

Using deprecated get_feature_names() method
Passing wrong data type to fit_transform
Incorrect import paths

5. You want to transform text data so that common words like 'the' and 'is' have less impact, but rare important words have higher scores. Which method should you use?

hard

A. One-hot encoding of words

B. CountVectorizer without stop words

C. Raw word counts from CountVectorizer

D. TF-IDF Vectorizer

Solution

Step 1: Understand the goal of reducing common word impact
Common words appear frequently but carry less meaning, so their impact should be lowered.
Step 2: Identify method that weighs words by importance
TF-IDF scores words higher if they are rare and important, reducing common word impact.
Final Answer:
TF-IDF Vectorizer -> Option D
Quick Check:
TF-IDF = importance weighting [OK]

Hint: Use TF-IDF to weigh rare words higher [OK]

Common Mistakes:

Using raw counts which treat all words equally
Assuming stop words removal alone solves importance
Confusing one-hot encoding with frequency weighting

Why Text feature basics (CountVectorizer, TF-IDF) in ML Python? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand CountVectorizer's role

Step 2: Differentiate from TF-IDF

Final Answer:

Quick Check:

Solution

Step 1: Recall correct sklearn import path

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Count unique words in sentences

Step 2: Understand shape of output matrix

Final Answer:

Quick Check:

Solution

Step 1: Check method usage for feature names

Step 2: Use updated method

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal of reducing common word impact

Step 2: Identify method that weighs words by importance

Final Answer:

Quick Check: