Introduction
Choosing the right number of topics helps us find clear and useful groups in text data without making it too simple or too confusing.
Jump into concepts and practice - no test required
model = LatentDirichletAllocation(n_components=number_of_topics) model.fit(data)
model = LatentDirichletAllocation(n_components=5)
model.fit(data)model = LatentDirichletAllocation(n_components=10)
model.fit(data)from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer # Sample documents texts = [ 'I love reading about machine learning and AI.', 'Deep learning is a part of machine learning.', 'The economy is growing fast this year.', 'Stock markets are unpredictable and volatile.', 'AI can help improve healthcare and medicine.', 'Investing in stocks requires knowledge of the market.' ] # Convert texts to word counts vectorizer = CountVectorizer(stop_words='english') data = vectorizer.fit_transform(texts) # Try different numbers of topics for n_topics in [2, 3]: model = LatentDirichletAllocation(n_components=n_topics, random_state=0) model.fit(data) print(f'Number of topics: {n_topics}') for idx, topic in enumerate(model.components_): top_words = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-3:][::-1]] print(f' Topic {idx+1}: {", ".join(top_words)}') print()
num_topics.num_topics=5, others use incorrect parameter names.W if n_components=4 and the input X has shape (100, 500)?
from sklearn.decomposition import NMF model = NMF(n_components=4, random_state=42) W = model.fit_transform(X)
num_topics=10 but found many topics have very similar top words. What is the likely issue and how to fix it?
num_topics helps merge similar topics into clearer, distinct groups.