0
0
NLPml~5 mins

Vocabulary size control in NLP

Choose your learning style9 modes available
Introduction
Controlling vocabulary size helps models focus on important words and run faster by ignoring rare or unimportant words.
When building a text classifier and you want to reduce noise from rare words.
When training a language model and you need to limit memory use.
When preparing text data for chatbots to keep the model simple.
When working with limited computing power and want faster training.
When you want to improve model generalization by ignoring very rare words.
Syntax
NLP
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=VOCAB_SIZE)
X = vectorizer.fit_transform(texts)
max_features sets the maximum number of words to keep based on frequency.
Only the top VOCAB_SIZE most common words are kept in the vocabulary.
Examples
Keeps only the 1000 most frequent words from the text.
NLP
vectorizer = CountVectorizer(max_features=1000)
Keeps up to 500 words that appear in at least 5 documents.
NLP
vectorizer = CountVectorizer(max_features=500, min_df=5)
Keeps top 2000 words excluding common English stop words.
NLP
vectorizer = CountVectorizer(max_features=2000, stop_words='english')
Sample Model
This example limits the vocabulary to the top 5 words by frequency from the sample texts. It then shows the vocabulary and the transformed feature matrix.
NLP
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python',
    'Python coding is great for machine learning'
]

VOCAB_SIZE = 5
vectorizer = CountVectorizer(max_features=VOCAB_SIZE)
X = vectorizer.fit_transform(texts)

print('Vocabulary:', vectorizer.get_feature_names_out())
print('Transformed shape:', X.shape)
print('Feature matrix (dense):\n', X.toarray())
OutputSuccess
Important Notes
Choosing a very small vocabulary size may lose important words and reduce model accuracy.
Using max_features keeps the most frequent words, which usually carry more meaning.
You can combine vocabulary size control with stop word removal for cleaner data.
Summary
Vocabulary size control limits the number of words the model uses.
It helps speed up training and reduce noise from rare words.
Use max_features in vectorizers to set vocabulary size easily.