2. Which parameter in scikit-learn's CountVectorizer controls the vocabulary size?

easy

A. max_features

B. min_df

C. stop_words

D. ngram_range

3. What will be the output vocabulary size after running this code?

from sklearn.feature_extraction.text import CountVectorizer
texts = ['apple banana apple', 'banana orange', 'apple orange orange']
vectorizer = CountVectorizer(max_features=2)
vectorizer.fit(texts)
vocab = vectorizer.get_feature_names_out()
print(len(vocab))

medium

A. 3

B. 2

C. 4

D. 1

4. Identify the error in this code snippet that tries to limit vocabulary size:

from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat dog', 'dog mouse', 'cat mouse']
vectorizer = CountVectorizer(max_features='3')
vectorizer.fit(texts)
vocab = vectorizer.get_feature_names_out()
print(vocab)

medium

A. max_features should be an integer, not a string

B. fit() should be replaced with fit_transform()

C. get_feature_names_out() is deprecated

D. texts should be a numpy array

5. You want to build a text classifier but your dataset has 100,000 unique words. To speed up training and reduce noise, which approach best controls vocabulary size?

hard

A. Increase max_features to 200,000 to include more words

B. Use all 100,000 words to keep maximum information

C. Remove stop words only without limiting vocabulary size

D. Set max_features to a smaller number like 5000 in your vectorizer

Vocabulary size control in NLP - Interactive Code Practice

Start learning this pattern below

Practice

Solution

Step 1: Understand vocabulary size control

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Recall CountVectorizer parameters

Step 2: Identify parameter for vocabulary size

Final Answer:

Quick Check:

Solution

Step 1: Understand max_features effect

Step 2: Count unique words and frequencies

Final Answer:

Quick Check:

Solution

Step 1: Check max_features type

Step 2: Confirm other parts are correct

Final Answer:

Quick Check:

Solution

Step 1: Understand problem with large vocabulary

Step 2: Choose best vocabulary control method

Final Answer:

Quick Check: