0
0
ML Pythonml~20 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text Feature Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of CountVectorizer on simple text
What is the output of the following code snippet using CountVectorizer?
ML Python
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['apple orange apple', 'orange banana orange']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
result = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print(feature_names)
print(result)
A['apple' 'banana' 'orange']\n[[2 0 1]\n [0 1 2]]
B['apple' 'banana' 'orange']\n[[1 0 2]\n [0 2 1]]
C['apple' 'banana' 'orange']\n[[2 1 0]\n [1 0 2]]
D['banana' 'apple' 'orange']\n[[2 0 1]\n [0 1 2]]
Attempts:
2 left
💡 Hint
CountVectorizer sorts features alphabetically and counts word occurrences per document.
🧠 Conceptual
intermediate
1:30remaining
Understanding TF-IDF importance
Which statement best describes why TF-IDF is useful compared to simple word counts?
ATF-IDF counts the total number of words in a document without weighting.
BTF-IDF reduces the weight of common words and highlights rare but important words.
CTF-IDF only counts words that appear in all documents equally.
DTF-IDF ignores word frequency and only uses document length.
Attempts:
2 left
💡 Hint
Think about how common words like 'the' or 'and' should be treated in text analysis.
Metrics
advanced
2:00remaining
Comparing vector lengths from CountVectorizer and TF-IDF
Given the same text corpus, which statement about the vector lengths produced by CountVectorizer and TfidfVectorizer is true?
AVectors from TfidfVectorizer usually have smaller values but the same length as CountVectorizer vectors.
BVectors from CountVectorizer always have larger length because they count words multiple times.
CVectors from TfidfVectorizer are always longer because they add extra features.
DVectors from CountVectorizer and TfidfVectorizer have different lengths because they use different vocabularies.
Attempts:
2 left
💡 Hint
Both vectorizers use the same vocabulary by default.
🔧 Debug
advanced
2:00remaining
Identifying error in TF-IDF code snippet
What error will this code raise and why? from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['cat dog', 'dog mouse'] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
AValueError because 'english' stop words remove all words leaving empty vocabulary
BAttributeError because get_feature_names_out() does not exist
CTypeError because stop_words must be a list, not a string
DNo error; output is ['cat' 'dog' 'mouse']
Attempts:
2 left
💡 Hint
Check what words remain after removing English stop words from the corpus.
Model Choice
expert
2:30remaining
Choosing the best vectorizer for short text classification
You want to classify very short text messages (like tweets) where common words appear frequently but are not useful. Which vectorizer choice is best and why?
AUse CountVectorizer with max_features=10 to limit vocabulary size.
BUse CountVectorizer without stop words because raw counts capture all info.
CUse TfidfVectorizer with English stop words to reduce common word impact and highlight rare words.
DUse TfidfVectorizer without stop words to keep all words weighted equally.
Attempts:
2 left
💡 Hint
Think about how to reduce noise from common words in short texts.