0
0
NLPml~20 mins

Choosing number of topics in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Choosing number of topics
Problem:You want to find the best number of topics for a topic model on a text dataset. Currently, you use 10 topics but the model's coherence score is low, indicating poor topic quality.
Current Metrics:Number of topics: 10, Coherence score: 0.35
Issue:The coherence score is low, meaning topics are not very meaningful. The number of topics may be too high or too low.
Your Task
Find the number of topics between 5 and 15 that gives the highest coherence score, improving topic quality.
Use the same dataset and preprocessing steps.
Only change the number of topics parameter.
Evaluate coherence score for each model.
Hint 1
Hint 2
Hint 3
Solution
NLP
import gensim
from gensim import corpora
from gensim.models import LdaModel, CoherenceModel

# Sample preprocessed documents (list of token lists)
documents = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

# Create dictionary and corpus
id2word = corpora.Dictionary(documents)
corpus = [id2word.doc2bow(text) for text in documents]

best_num_topics = None
best_coherence = -1
coherence_scores = {}

for num_topics in range(5, 16):
    lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, random_state=42, passes=10)
    coherence_model = CoherenceModel(model=lda_model, texts=documents, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores[num_topics] = coherence_score
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_num_topics = num_topics

print(f"Best number of topics: {best_num_topics}")
print(f"Best coherence score: {best_coherence:.4f}")
print("Coherence scores for all tested topic numbers:")
for k, v in coherence_scores.items():
    print(f"{k}: {v:.4f}")
Tested multiple numbers of topics from 5 to 15 instead of fixed 10.
Calculated coherence score for each model to evaluate topic quality.
Selected the number of topics with the highest coherence score.
Results Interpretation

Before: Number of topics = 10, Coherence score = 0.35

After: Best number of topics = 7, Coherence score = 0.48

Choosing the right number of topics improves the quality of topics. Evaluating coherence scores helps find the best number.
Bonus Experiment
Try using a different coherence measure like 'u_mass' or 'c_uci' to see if the best number of topics changes.
💡 Hint
Change the coherence parameter in CoherenceModel and compare results.