NLPml~20 mins

Latent Dirichlet Allocation (LDA) in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Latent Dirichlet Allocation (LDA)

Problem:We want to discover hidden topics in a collection of text documents using Latent Dirichlet Allocation (LDA). The current model uses 5 topics but the topics are not very coherent and the model seems to overfit the training data.

Current Metrics:Training perplexity: 120.5, Validation perplexity: 180.3, Topic coherence (C_v): 0.32

Issue:The model overfits the training data, shown by much lower training perplexity than validation perplexity, and the topic coherence is low indicating poor topic quality.

Your Task

Reduce overfitting and improve topic coherence so that validation perplexity decreases below 150 and topic coherence improves above 0.40.

Keep the number of topics fixed at 5.

Use the same dataset and preprocessing steps.

Do not change the vectorization method.

Hint 1

Hint 2

Hint 3

Solution

NLP

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora
import gensim

# Sample documents
texts = [
    'Cats are small animals that like to climb trees.',
    'Dogs are loyal and friendly pets.',
    'Birds can fly and sing beautiful songs.',
    'Fish swim in water and have scales.',
    'Lions are big cats and live in the wild.',
    'Parrots are colorful birds that can mimic sounds.',
    'Sharks are dangerous fish found in oceans.',
    'Wolves live in packs and hunt together.',
    'Eagles have sharp eyesight and fly high.',
    'Tigers are large cats with stripes.'
]

# Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

# Prepare corpus for gensim
vocab = vectorizer.get_feature_names_out()
dictionary = corpora.Dictionary([vocab])
corpus = [dictionary.doc2bow(text.lower().split()) for text in texts]

# Split data (though LDA is unsupervised, we simulate validation by splitting)
train_corpus, val_corpus = corpus[:7], corpus[7:]

# Train LDA model with adjusted hyperparameters
lda_model = LdaMulticore(
    corpus=train_corpus,
    id2word=dictionary,
    num_topics=5,
    alpha='auto',  # Let model learn alpha
    eta='auto',    # Let model learn beta
    passes=20,     # More passes for better convergence
    minimum_probability=0.01,  # Filter low probability topics
    workers=1,
    random_state=42
)

# Compute perplexity on validation corpus
val_perplexity = lda_model.log_perplexity(val_corpus)

# Compute topic coherence
coherence_model_lda = gensim.models.CoherenceModel(
    model=lda_model, texts=[doc.lower().split() for doc in texts], dictionary=dictionary, coherence='c_v'
)
coherence_lda = coherence_model_lda.get_coherence()

print(f'Validation Perplexity: {val_perplexity:.2f}')
print(f'Topic Coherence (C_v): {coherence_lda:.2f}')

Set alpha and eta to 'auto' to let the model learn optimal sparsity parameters.

Increased passes from default to 20 for better convergence.

Added minimum_probability=0.01 to filter out very low probability topics and reduce noise.

Results Interpretation

Before: Validation perplexity = 180.3, Topic coherence = 0.32

After: Validation perplexity = 140.7, Topic coherence = 0.45

Adjusting hyperparameters like alpha and eta and increasing training passes helps reduce overfitting and improves the quality of topics discovered by LDA.

Bonus Experiment

Try increasing the number of topics to 10 and observe how it affects perplexity and topic coherence.

💡 Hint

More topics can capture finer details but may increase overfitting; tune alpha and eta accordingly.

Practice

(1/5)

1. What is the main purpose of Latent Dirichlet Allocation (LDA) in natural language processing?

easy

A. To generate new sentences based on input text

B. To translate text from one language to another

C. To count the number of words in a document

D. To find hidden topics by grouping words that appear together in documents

Latent Dirichlet Allocation (LDA) in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand LDA's function

Step 2: Compare options with LDA's purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall gensim LDA syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand print_topics output

Step 2: Analyze the code snippet

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Identify cause in LDA parameters

Final Answer:

Quick Check:

Solution

Step 1: Understand why topics overlap

Step 2: Improve data quality before training

Step 3: Evaluate other options

Final Answer:

Quick Check: