Which statement best explains why TF-IDF is preferred over simple Bag of Words for text classification?
Think about how common words like 'the' or 'and' affect simple word counts.
TF-IDF lowers the weight of very common words across all documents, so unique and important words stand out more for classification.
What is the output of the following Python code using CountVectorizer from sklearn?
from sklearn.feature_extraction.text import CountVectorizer corpus = ['apple banana apple', 'banana orange', 'apple orange orange'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(X.toarray())
CountVectorizer orders words alphabetically and counts their occurrences per document.
The words are ordered as ['apple', 'banana', 'orange']. The counts per document match the output in option D.
You have a large collection of short text messages with many unique words appearing rarely. Which vectorization method is best to reduce noise and improve model performance?
Consider how to reduce the effect of rare or common words in sparse text data.
TF-IDF weights words by their importance, reducing noise from rare or overly common words, which helps models learn better from sparse data.
After applying Bag of Words and TF-IDF vectorization separately on the same dataset, you train a classifier. Which metric difference best indicates TF-IDF improved the model?
Better vectorization should improve correct predictions and reduce mistakes.
Improved vectorization like TF-IDF usually leads to higher accuracy and fewer false positives, showing better model performance.
Given this code snippet, what error or issue will occur?
from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['cat dog', 'dog mouse', 'cat mouse mouse'] vectorizer = TfidfVectorizer(stop_words=['dog']) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
Check how TfidfVectorizer accepts stop words as input.
TfidfVectorizer accepts a list of stop words and removes them from the vocabulary, so 'dog' is excluded.