0
0
ML Pythonml~20 mins

Bag of Words and TF-IDF in ML Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Bag of Words and TF-IDF Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding TF-IDF Importance

Which statement best explains why TF-IDF is preferred over simple Bag of Words for text classification?

ATF-IDF reduces the impact of common words by weighting them lower, highlighting important words unique to documents.
BTF-IDF converts text into images for better visual classification.
CTF-IDF removes all stop words from the text before analysis.
DTF-IDF counts the total number of words in a document to improve classification accuracy.
Attempts:
2 left
💡 Hint

Think about how common words like 'the' or 'and' affect simple word counts.

Predict Output
intermediate
2:00remaining
Output of Bag of Words Vectorization

What is the output of the following Python code using CountVectorizer from sklearn?

ML Python
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['apple banana apple', 'banana orange', 'apple orange orange']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
A
[[2 0 1]
 [1 1 0]
 [0 2 1]]
B
[[1 2 0]
 [0 1 1]
 [2 0 1]]
C
[[1 1 1]
 [1 1 1]
 [1 1 1]]
D
[[2 1 0]
 [0 1 1]
 [1 0 2]]
Attempts:
2 left
💡 Hint

CountVectorizer orders words alphabetically and counts their occurrences per document.

Model Choice
advanced
2:00remaining
Choosing Vectorization for Sparse Data

You have a large collection of short text messages with many unique words appearing rarely. Which vectorization method is best to reduce noise and improve model performance?

AUse TF-IDF to weight words by importance across messages.
BUse Bag of Words with raw counts only.
CUse one-hot encoding for each word.
DUse word embeddings without any weighting.
Attempts:
2 left
💡 Hint

Consider how to reduce the effect of rare or common words in sparse text data.

Metrics
advanced
2:00remaining
Evaluating Text Vectorization Impact

After applying Bag of Words and TF-IDF vectorization separately on the same dataset, you train a classifier. Which metric difference best indicates TF-IDF improved the model?

ALower accuracy but higher training time with TF-IDF.
BHigher accuracy and lower false positive rate with TF-IDF.
CSame accuracy but higher false negative rate with TF-IDF.
DHigher accuracy but higher false positive rate with Bag of Words.
Attempts:
2 left
💡 Hint

Better vectorization should improve correct predictions and reduce mistakes.

🔧 Debug
expert
2:00remaining
Debugging TF-IDF Vectorizer Output

Given this code snippet, what error or issue will occur?

ML Python
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['cat dog', 'dog mouse', 'cat mouse mouse']
vectorizer = TfidfVectorizer(stop_words=['dog'])
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
AOutput will include 'dog' because stop_words parameter is ignored.
BSyntaxError due to incorrect stop_words parameter type.
COutput will be ['cat', 'mouse'] because 'dog' is removed as a stop word.
DValueError because stop_words must be a string, not a list.
Attempts:
2 left
💡 Hint

Check how TfidfVectorizer accepts stop words as input.