In Latent Dirichlet Allocation, what does a 'topic' most accurately represent?
Think about how LDA models topics as probabilities over vocabulary.
In LDA, each topic is a probability distribution over words, indicating which words are likely to co-occur in that topic.
Given the following Python code using sklearn's LDA, what is the shape of doc_topic_dist?
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer docs = ["apple banana apple", "banana orange banana", "apple orange orange"] vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(docs) lda = LatentDirichletAllocation(n_components=2, random_state=0) lda.fit(dtm) doc_topic_dist = lda.transform(dtm) print(doc_topic_dist.shape)
Check how many documents and topics are in the model.
The transform method returns the topic distribution for each document. There are 3 documents and 2 topics, so the shape is (3, 2).
You want to model topics in a large collection of news articles using LDA. Which approach is best to decide the number of topics?
Think about how to balance model complexity and interpretability.
Choosing the number of topics is often done by testing multiple values and evaluating metrics like coherence or perplexity, combined with domain knowledge.
After training an LDA model, you get a perplexity score of 1200 on your test set. What does a lower perplexity score indicate?
Perplexity measures how well the model predicts unseen data.
A lower perplexity means the model better predicts the test data, showing better generalization.
What error will this code raise?
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer docs = ["cat dog", "dog mouse", "cat mouse"] vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(docs) lda = LatentDirichletAllocation(n_components=3, random_state=0) lda.fit_transform(dtm) print(lda.components_.shape)
Check the shape of the document-term matrix and the number of topics.
The document-term matrix has 3 features (unique words). The model has 3 topics, so components_ shape is (3, 3). No error occurs.