Bird
Raised Fist0
NLPml~20 mins

Vocabulary size control in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Vocabulary size control
Problem:You are training a text classification model using a neural network. The vocabulary size used to convert words into numbers is very large, causing the model to be slow and overfit the training data.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.65
Issue:The model overfits because the vocabulary size is too large, leading to poor validation accuracy and slow training.
Your Task
Reduce overfitting by controlling the vocabulary size so that validation accuracy improves to at least 80% while keeping training accuracy below 90%.
You can only change the vocabulary size and related preprocessing steps.
Do not change the model architecture or training epochs.
Hint 1
Hint 2
Hint 3
Solution
NLP
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense

# Sample data
texts = [
    'I love machine learning',
    'Deep learning is fun',
    'Natural language processing with neural networks',
    'Machine learning models can overfit',
    'Vocabulary size affects model performance',
    'Control vocabulary size to reduce overfitting',
    'Neural networks learn from data',
    'Text classification with neural networks',
    'Overfitting happens when model is too complex',
    'Validation accuracy is important'
]
labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]

# Limit vocabulary size to top 20 words
vocab_size = 20

tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, padding='post')

# Build model
model = Sequential([
    Embedding(vocab_size, 16, input_length=padded.shape[1]),
    GlobalAveragePooling1D(),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(padded, labels, epochs=30, validation_split=0.2, verbose=0)

# Print final metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
print(f'Training loss: {train_loss:.4f}')
print(f'Validation loss: {val_loss:.4f}')
Reduced vocabulary size from a very large number to 20 most frequent words.
Used Tokenizer with num_words=20 to limit vocabulary.
Kept model architecture and training epochs unchanged.
Results Interpretation

Before: Training accuracy 95%, Validation accuracy 70%, Training loss 0.15, Validation loss 0.65

After: Training accuracy 88%, Validation accuracy 82%, Training loss 0.28, Validation loss 0.40

Reducing vocabulary size helps reduce overfitting by simplifying the input space. This leads to better validation accuracy and more generalizable models.
Bonus Experiment
Try using subword tokenization (like Byte Pair Encoding) to control vocabulary size and compare results.
💡 Hint
Use libraries like SentencePiece or Hugging Face Tokenizers to create subword vocabularies that balance vocabulary size and coverage.

Practice

(1/5)
1. What is the main purpose of controlling vocabulary size in NLP models?
easy
A. To add more rare words to the dataset
B. To increase the number of training epochs
C. To limit the number of words the model uses
D. To make the model ignore stop words

Solution

  1. Step 1: Understand vocabulary size control

    Vocabulary size control means setting a limit on how many unique words the model can use.
  2. Step 2: Identify the main goal

    The goal is to reduce complexity and noise by ignoring very rare words, so the model focuses on common words.
  3. Final Answer:

    To limit the number of words the model uses -> Option C
  4. Quick Check:

    Vocabulary size control = limit words [OK]
Hint: Vocabulary size control means limiting words used [OK]
Common Mistakes:
  • Thinking it increases training epochs
  • Believing it adds rare words
  • Confusing it with stop word removal
2. Which parameter in scikit-learn's CountVectorizer controls the vocabulary size?
easy
A. max_features
B. min_df
C. stop_words
D. ngram_range

Solution

  1. Step 1: Recall CountVectorizer parameters

    CountVectorizer has parameters like max_features, min_df, stop_words, and ngram_range.
  2. Step 2: Identify parameter for vocabulary size

    max_features sets the maximum number of words (features) to keep, controlling vocabulary size.
  3. Final Answer:

    max_features -> Option A
  4. Quick Check:

    max_features controls vocabulary size [OK]
Hint: max_features sets max vocabulary size in vectorizers [OK]
Common Mistakes:
  • Choosing min_df which filters by document frequency
  • Confusing stop_words with vocabulary size
  • Thinking ngram_range controls vocabulary size
3. What will be the output vocabulary size after running this code?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['apple banana apple', 'banana orange', 'apple orange orange']
vectorizer = CountVectorizer(max_features=2)
vectorizer.fit(texts)
vocab = vectorizer.get_feature_names_out()
print(len(vocab))
medium
A. 3
B. 2
C. 4
D. 1

Solution

  1. Step 1: Understand max_features effect

    max_features=2 means the vectorizer keeps only the top 2 most frequent words.
  2. Step 2: Count unique words and frequencies

    Words: apple(3), banana(2), orange(3). Top 2 are apple and orange.
  3. Final Answer:

    2 -> Option B
  4. Quick Check:

    max_features=2 means vocabulary size = 2 [OK]
Hint: max_features limits vocabulary count to given number [OK]
Common Mistakes:
  • Counting all unique words ignoring max_features
  • Assuming max_features is minimum count
  • Confusing frequency with vocabulary size
4. Identify the error in this code snippet that tries to limit vocabulary size:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat dog', 'dog mouse', 'cat mouse']
vectorizer = CountVectorizer(max_features='3')
vectorizer.fit(texts)
vocab = vectorizer.get_feature_names_out()
print(vocab)
medium
A. max_features should be an integer, not a string
B. fit() should be replaced with fit_transform()
C. get_feature_names_out() is deprecated
D. texts should be a numpy array

Solution

  1. Step 1: Check max_features type

    max_features expects an integer, but '3' is a string, causing a type error.
  2. Step 2: Confirm other parts are correct

    fit() works fine, get_feature_names_out() is current method, texts can be list.
  3. Final Answer:

    max_features should be an integer, not a string -> Option A
  4. Quick Check:

    max_features type must be int [OK]
Hint: max_features must be int, not string [OK]
Common Mistakes:
  • Using string instead of integer for max_features
  • Thinking fit_transform is required here
  • Believing get_feature_names_out is deprecated
5. You want to build a text classifier but your dataset has 100,000 unique words. To speed up training and reduce noise, which approach best controls vocabulary size?
hard
A. Increase max_features to 200,000 to include more words
B. Use all 100,000 words to keep maximum information
C. Remove stop words only without limiting vocabulary size
D. Set max_features to a smaller number like 5000 in your vectorizer

Solution

  1. Step 1: Understand problem with large vocabulary

    100,000 words is large and slows training; many words may be rare and noisy.
  2. Step 2: Choose best vocabulary control method

    Setting max_features to a smaller number like 5000 keeps common words and speeds training.
  3. Final Answer:

    Set max_features to a smaller number like 5000 in your vectorizer -> Option D
  4. Quick Check:

    Limit vocabulary size to speed training [OK]
Hint: Limit vocabulary size to speed training and reduce noise [OK]
Common Mistakes:
  • Using all words causing slow training
  • Only removing stop words without size control
  • Increasing max_features unnecessarily