0
0
NLPml~20 mins

Bag of Words (CountVectorizer) in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Bag of Words Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of CountVectorizer on simple text
What is the output of the following code snippet using CountVectorizer from scikit-learn?
NLP
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['apple banana apple', 'banana orange']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
result = X.toarray()
vocab = vectorizer.get_feature_names_out()
print(vocab)
print(result)
A['apple' 'orange' 'banana']\n[[2 0 1]\n [0 1 1]]
B['banana' 'apple' 'orange']\n[[1 2 0]\n [1 0 1]]
C['apple' 'banana' 'orange']\n[[2 1 0]\n [0 1 1]]
D['apple' 'banana' 'orange']\n[[1 2 0]\n [0 1 1]]
Attempts:
2 left
💡 Hint
CountVectorizer sorts vocabulary alphabetically and counts word occurrences per document.
🧠 Conceptual
intermediate
1:30remaining
Understanding vocabulary size in CountVectorizer
Given the corpus: ['cat dog', 'dog mouse', 'cat mouse dog'], what is the vocabulary size created by CountVectorizer with default settings?
A3
B5
C4
D2
Attempts:
2 left
💡 Hint
CountVectorizer creates one vocabulary word per unique token across all documents.
Hyperparameter
advanced
2:00remaining
Effect of stop_words parameter in CountVectorizer
What will be the vocabulary output of CountVectorizer when applied to ['the cat sat', 'the dog barked'] with stop_words='english'?
A['the', 'cat', 'sat', 'dog', 'barked']
B['barked', 'cat', 'dog', 'sat']
C['cat', 'dog']
D['the']
Attempts:
2 left
💡 Hint
The stop_words='english' removes common English words like 'the'.
Metrics
advanced
1:30remaining
Calculating document frequency with CountVectorizer
Using CountVectorizer on ['apple apple banana', 'banana orange', 'apple orange orange'], what is the document frequency (number of documents containing the word) for 'apple'?
A2
B3
C1
D0
Attempts:
2 left
💡 Hint
Document frequency counts in how many documents the word appears at least once.
🔧 Debug
expert
2:00remaining
Identifying error in CountVectorizer usage
What error will the following code raise? from sklearn.feature_extraction.text import CountVectorizer corpus = ['hello world', 123, 'hello'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus)
AAttributeError: 'int' object has no attribute 'lower'
BValueError: empty vocabulary; perhaps the documents only contain stop words
CNo error, code runs successfully
DTypeError: expected string or bytes-like object
Attempts:
2 left
💡 Hint
CountVectorizer expects all documents to be strings.