Challenge - 5 Problems
Bag of Words Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of CountVectorizer on simple text
What is the output of the following code snippet using CountVectorizer from scikit-learn?
NLP
from sklearn.feature_extraction.text import CountVectorizer corpus = ['apple banana apple', 'banana orange'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) result = X.toarray() vocab = vectorizer.get_feature_names_out() print(vocab) print(result)
Attempts:
2 left
💡 Hint
CountVectorizer sorts vocabulary alphabetically and counts word occurrences per document.
✗ Incorrect
The CountVectorizer creates a vocabulary sorted alphabetically: ['apple', 'banana', 'orange']. It counts how many times each word appears in each document. The first document has 'apple' twice and 'banana' once, no 'orange'. The second document has 'banana' once and 'orange' once.
🧠 Conceptual
intermediate1:30remaining
Understanding vocabulary size in CountVectorizer
Given the corpus: ['cat dog', 'dog mouse', 'cat mouse dog'], what is the vocabulary size created by CountVectorizer with default settings?
Attempts:
2 left
💡 Hint
CountVectorizer creates one vocabulary word per unique token across all documents.
✗ Incorrect
The unique words are 'cat', 'dog', and 'mouse'. So the vocabulary size is 3.
❓ Hyperparameter
advanced2:00remaining
Effect of stop_words parameter in CountVectorizer
What will be the vocabulary output of CountVectorizer when applied to ['the cat sat', 'the dog barked'] with stop_words='english'?
Attempts:
2 left
💡 Hint
The stop_words='english' removes common English words like 'the'.
✗ Incorrect
Stop words like 'the' are removed. Remaining words are 'cat', 'sat', 'dog', 'barked'. Vocabulary is sorted alphabetically.
❓ Metrics
advanced1:30remaining
Calculating document frequency with CountVectorizer
Using CountVectorizer on ['apple apple banana', 'banana orange', 'apple orange orange'], what is the document frequency (number of documents containing the word) for 'apple'?
Attempts:
2 left
💡 Hint
Document frequency counts in how many documents the word appears at least once.
✗ Incorrect
'apple' appears in the first and third documents, so document frequency is 2.
🔧 Debug
expert2:00remaining
Identifying error in CountVectorizer usage
What error will the following code raise?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['hello world', 123, 'hello']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
Attempts:
2 left
💡 Hint
CountVectorizer expects all documents to be strings.
✗ Incorrect
The corpus contains an integer 123, which causes CountVectorizer to raise a TypeError because it expects strings.