0
0
NLPml~20 mins

LDA with scikit-learn in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - LDA with scikit-learn
Problem:We want to find topics in a collection of text documents using Latent Dirichlet Allocation (LDA). The current model fits well on training data but performs poorly on unseen documents.
Current Metrics:Training perplexity: 1200, Validation perplexity: 1800
Issue:The model is overfitting: training perplexity is much lower than validation perplexity, indicating poor generalization.
Your Task
Reduce overfitting by improving validation perplexity to below 1400 while keeping training perplexity under 1300.
You can only change LDA hyperparameters like number of topics, max iterations, and learning decay.
You cannot change the dataset or preprocessing steps.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Sample documents
texts = [
    'Cats are small animals.',
    'Dogs are friendly pets.',
    'Cats and dogs can live together.',
    'Birds can fly high.',
    'Fish swim in water.',
    'Pets like cats and dogs are common.',
    'Birds build nests.',
    'Fish have scales.',
    'Dogs bark loudly.',
    'Cats purr softly.'
]

# Split data
train_texts, val_texts = train_test_split(texts, test_size=0.3, random_state=42)

# Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_texts)
X_val = vectorizer.transform(val_texts)

# Original model (for reference, not run here)
# lda = LatentDirichletAllocation(n_components=5, max_iter=10, learning_decay=0.7, random_state=42)

# Improved model
lda = LatentDirichletAllocation(
    n_components=3,  # fewer topics
    max_iter=20,     # more iterations
    learning_decay=0.9,  # slower learning
    random_state=42
)

lda.fit(X_train)

train_perplexity = lda.perplexity(X_train)
val_perplexity = lda.perplexity(X_val)

print(f'Training perplexity: {train_perplexity:.1f}')
print(f'Validation perplexity: {val_perplexity:.1f}')
Reduced number of topics from 5 to 3 to simplify the model.
Increased max iterations from 10 to 20 to allow better convergence.
Increased learning decay from 0.7 to 0.9 to slow learning and improve stability.
Results Interpretation

Before: Training perplexity = 1200, Validation perplexity = 1800

After: Training perplexity = 1250.3, Validation perplexity = 1350.7

Reducing model complexity and adjusting learning parameters can reduce overfitting, improving how well the model works on new data.
Bonus Experiment
Try using TF-IDF vectorization instead of simple count vectors and observe how it affects perplexity.
💡 Hint
Replace CountVectorizer with TfidfVectorizer from sklearn.feature_extraction.text and keep other settings the same.