0
0
NLPml~20 mins

LDA with Gensim in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - LDA with Gensim
Problem:We want to find topics in a collection of text documents using LDA (Latent Dirichlet Allocation) with Gensim.
Current Metrics:Coherence score: 0.35
Issue:The model shows low coherence score, indicating topics are not very meaningful or clear.
Your Task
Improve the coherence score to above 0.45 by tuning the number of topics and passes without overfitting.
Use Gensim's LdaModel only.
Do not change the preprocessing steps.
Keep the number of topics between 2 and 10.
Hint 1
Hint 2
Hint 3
Solution
NLP
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

# Sample documents
texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

# Create dictionary and corpus
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

# Train LDA model with tuned parameters
lda_model = LdaModel(corpus=corpus,
                     id2word=id2word,
                     num_topics=4,
                     random_state=100,
                     update_every=1,
                     chunksize=10,
                     passes=20,
                     alpha='auto',
                     per_word_topics=True)

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print(f"Coherence Score: {coherence_lda:.4f}")
Increased number of passes from default to 20 to allow more training iterations.
Set number of topics to 4 to better capture distinct themes.
Used 'auto' alpha to let model optimize topic distribution sparsity.
Results Interpretation

Before tuning: Coherence score = 0.35
After tuning: Coherence score = 0.52

Increasing passes and adjusting the number of topics helps the LDA model find clearer, more meaningful topics, improving coherence.
Bonus Experiment
Try using a different coherence measure like 'u_mass' and compare results.
💡 Hint
Change the coherence parameter in CoherenceModel to 'u_mass' and observe how scores differ.