Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the Bag of Words model in text processing?
It is a way to represent text by counting how many times each word appears, ignoring grammar and word order.
Click to reveal answer
beginner
What does CountVectorizer do in NLP?
CountVectorizer converts a collection of text documents into a matrix of word counts, showing how often each word appears in each document.
Click to reveal answer
beginner
Why does Bag of Words ignore word order?
Because it focuses only on the frequency of words, not their position, to simplify text into numbers for machine learning.
Click to reveal answer
intermediate
How does CountVectorizer handle new words not seen during training?
It ignores words not in its vocabulary, so new words in test data are not counted in the output matrix.
Click to reveal answer
intermediate
What is a limitation of the Bag of Words model?
It loses the meaning from word order and context, so it can’t understand phrases or sentence structure.
Click to reveal answer
What does CountVectorizer output for a set of text documents?
AA matrix of word counts per document
BA list of sentences
CA summary of text meaning
DA list of synonyms
✗ Incorrect
CountVectorizer creates a matrix showing how many times each word appears in each document.
Which aspect does Bag of Words ignore?
AWord frequency
BWord spelling
CWord order
DWord count
✗ Incorrect
Bag of Words counts words but does not consider the order they appear in.
If a new word appears in test data but not in training, what happens in CountVectorizer?
AIt ignores the word
BIt counts the word normally
CIt adds the word to the vocabulary
DIt throws an error
✗ Incorrect
CountVectorizer ignores words not in its learned vocabulary.
Why is Bag of Words useful for machine learning?
AIt translates text into another language
BIt converts text into numbers that models can understand
CIt summarizes text meaning
DIt corrects grammar mistakes
✗ Incorrect
Machine learning models need numbers, and Bag of Words turns text into numeric counts.
Which is a common problem with Bag of Words?
AIt requires labeled data
BIt needs a lot of memory for small texts
CIt only works with numbers
DIt loses context and word order
✗ Incorrect
Bag of Words does not keep the order or meaning of words, only counts.
Explain how CountVectorizer transforms text data into a format usable by machine learning models.
Think about how text is turned into numbers by counting words.
You got /4 concepts.
Describe one main limitation of the Bag of Words model and why it matters.
Consider what information is lost when only counting words.
You got /4 concepts.
Practice
(1/5)
1. What does the Bag of Words model do in text processing?
easy
A. Counts how often each word appears in the text
B. Translates text into another language
C. Removes all punctuation from the text
D. Generates summaries of the text
Solution
Step 1: Understand Bag of Words purpose
Bag of Words counts the frequency of each word in a text, ignoring order.
Step 2: Compare options to definition
Only Counts how often each word appears in the text matches this description exactly.
Final Answer:
Counts how often each word appears in the text -> Option A
Quick Check:
Bag of Words = Counts words [OK]
Hint: Bag of Words counts words, not translates or summarizes [OK]
Common Mistakes:
Confusing Bag of Words with translation
Thinking it removes punctuation only
Assuming it summarizes text
2. Which of the following is the correct way to import CountVectorizer from scikit-learn in Python?
easy
A. import CountVectorizer from sklearn.feature_extraction
B. from sklearn.feature_extraction.text import CountVectorizer
C. from sklearn.text import CountVectorizer
D. import CountVectorizer from sklearn.text
Solution
Step 1: Recall correct import path
CountVectorizer is in sklearn.feature_extraction.text module.
Step 2: Match options to correct syntax
Only from sklearn.feature_extraction.text import CountVectorizer uses the correct 'from ... import ...' syntax and correct module path.
Final Answer:
from sklearn.feature_extraction.text import CountVectorizer -> Option B
Quick Check:
Correct import path = from sklearn.feature_extraction.text import CountVectorizer [OK]
Hint: CountVectorizer is in sklearn.feature_extraction.text [OK]
Common Mistakes:
Using wrong module path
Incorrect import syntax
Trying to import from sklearn.text
3. What will be the output shape of the matrix after applying CountVectorizer on these two sentences: ['I love cats', 'Cats love me']?
medium
A. (3, 2)
B. (2, 3)
C. (4, 2)
D. (2, 4)
Solution
Step 1: Identify unique words
Words are: 'I', 'love', 'cats', 'me' (case insensitive, 'Cats' and 'cats' same).
Step 2: Count sentences and features
There are 2 sentences and 4 unique words, so matrix shape is (2, 4).
Final Answer:
(2, 4) -> Option D
Quick Check:
2 sentences, 4 words = (2, 4) [OK]
Hint: Count unique words and sentences for shape (rows, columns) [OK]
Common Mistakes:
Counting words per sentence instead of unique words
Mixing rows and columns in shape
Ignoring case sensitivity
4. The following code throws an error. What is the mistake?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hello world', 'hello']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())
print(vectorizer.get_feature_names())
medium
A. get_feature_names() is deprecated, should use get_feature_names_out()
B. fit_transform() should be fit_transform_text()
C. toarray() is not a method of X
D. CountVectorizer() needs a parameter for language
Solution
Step 1: Identify deprecated method
get_feature_names() is deprecated in recent sklearn versions.
Step 2: Use correct method
Replace get_feature_names() with get_feature_names_out() to fix error.
Final Answer:
get_feature_names() is deprecated, should use get_feature_names_out() -> Option A
Quick Check:
Use get_feature_names_out() not get_feature_names() [OK]
Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
Common Mistakes:
Thinking fit_transform() is wrong
Assuming toarray() is invalid
Believing CountVectorizer needs language parameter
5. You have a list of sentences with some words repeated many times. How can you use CountVectorizer to ignore words that appear in more than 50% of the sentences?
hard
A. Set min_df=0.5 to ignore frequent words
B. Use stop_words='english' to remove frequent words
C. Set the parameter max_df=0.5 when creating CountVectorizer
D. Set max_features=0.5 to limit word count
Solution
Step 1: Understand max_df parameter
max_df=0.5 tells CountVectorizer to ignore words in more than 50% of documents.
Step 2: Compare other options
min_df controls minimum frequency, stop_words removes common English words, max_features limits number of features, none ignore frequent words by percentage.
Final Answer:
Set the parameter max_df=0.5 when creating CountVectorizer -> Option C
Quick Check:
max_df filters frequent words by document frequency [OK]
Hint: Use max_df to exclude very common words [OK]