In topic modeling, selecting the number of topics affects the model's usefulness. Why is this choice important?
Think about how topics represent themes in the data.
Choosing the right number of topics balances between merging distinct themes and splitting meaningful themes, affecting interpretability and usefulness.
When training topic models, which metric is commonly used to evaluate and choose the best number of topics?
Think about a metric that measures prediction quality on new data.
Perplexity measures how well the model predicts new data, helping to find a number of topics that generalizes well.
Given the code below that computes perplexity for different numbers of topics, what is the output?
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer texts = ["apple banana fruit", "banana orange fruit", "car truck vehicle", "truck bus vehicle"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) perplexities = {} for n_topics in [2, 3]: lda = LatentDirichletAllocation(n_components=n_topics, random_state=0) lda.fit(X) perplexities[n_topics] = lda.perplexity(X) print(perplexities)
Lower perplexity means better model fit.
With 3 topics, the model fits better, so perplexity is lower than with 2 topics.
You want to select the number of topics for an LDA model using coherence score. Which approach is best?
Coherence measures how interpretable topics are.
Training multiple models and selecting the one with highest coherence ensures topics are meaningful and interpretable.
After increasing the number of topics in your LDA model beyond 10, you notice coherence scores drop and topics become less meaningful. What is the most likely cause?
Think about what happens when a model tries to create too many topics.
Too many topics cause the model to split meaningful themes into smaller, less coherent topics, reducing quality.