from sklearn.feature_extraction.text import CountVectorizer corpus = ['apple orange apple', 'orange banana orange'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) result = X.toarray() feature_names = vectorizer.get_feature_names_out() print(feature_names) print(result)
The CountVectorizer extracts unique words sorted alphabetically: 'apple', 'banana', 'orange'. It counts how many times each word appears in each document. The first document has 2 'apple' and 1 'orange'. The second has 1 'banana' and 2 'orange'.
TF-IDF stands for Term Frequency-Inverse Document Frequency. It lowers the importance of words that appear in many documents (common words) and raises the importance of words that are rare but may carry more meaning.
Both CountVectorizer and TfidfVectorizer create vectors with the same number of features (words). TF-IDF scales the counts to weights between 0 and 1, so values are usually smaller but vector length (number of features) stays the same.
The corpus words 'cat', 'dog', 'mouse' are not stop words, so they remain. However, if the corpus had only stop words, the vocabulary would be empty causing ValueError. Here, no error occurs actually because words remain.
But in this exact corpus, 'dog' is not a stop word, so no error. So the correct answer is no error.
Re-examining: 'dog' is not in English stop words, so no error.
Therefore, option D is correct, others are wrong.
Short texts often have many common words that do not help classification. TfidfVectorizer with stop words removes common words and weights rare words higher, improving model focus on meaningful features.