Challenge - 5 Problems

🎖️

Bag of Words and TF-IDF Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding TF-IDF Importance

Which statement best explains why TF-IDF is preferred over simple Bag of Words for text classification?

ATF-IDF reduces the impact of common words by weighting them lower, highlighting important words unique to documents.

BTF-IDF converts text into images for better visual classification.

CTF-IDF removes all stop words from the text before analysis.

DTF-IDF counts the total number of words in a document to improve classification accuracy.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of Bag of Words Vectorization

What is the output of the following Python code using CountVectorizer from sklearn?

ML Python

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['apple banana apple', 'banana orange', 'apple orange orange']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[2 0 1]
 [1 1 0]
 [0 2 1]]

[[1 2 0]
 [0 1 1]
 [2 0 1]]

[[1 1 1]
 [1 1 1]
 [1 1 1]]

[[2 1 0]
 [0 1 1]
 [1 0 2]]

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Choosing Vectorization for Sparse Data

You have a large collection of short text messages with many unique words appearing rarely. Which vectorization method is best to reduce noise and improve model performance?

AUse TF-IDF to weight words by importance across messages.

BUse Bag of Words with raw counts only.

CUse one-hot encoding for each word.

DUse word embeddings without any weighting.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Evaluating Text Vectorization Impact

After applying Bag of Words and TF-IDF vectorization separately on the same dataset, you train a classifier. Which metric difference best indicates TF-IDF improved the model?

ALower accuracy but higher training time with TF-IDF.

BHigher accuracy and lower false positive rate with TF-IDF.

CSame accuracy but higher false negative rate with TF-IDF.

DHigher accuracy but higher false positive rate with Bag of Words.

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Debugging TF-IDF Vectorizer Output

Given this code snippet, what error or issue will occur?

ML Python

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['cat dog', 'dog mouse', 'cat mouse mouse']
vectorizer = TfidfVectorizer(stop_words=['dog'])
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

AOutput will include 'dog' because stop_words parameter is ignored.

BSyntaxError due to incorrect stop_words parameter type.

COutput will be ['cat', 'mouse'] because 'dog' is removed as a stop word.

DValueError because stop_words must be a string, not a list.

Attempts:

2 left