Bird
Raised Fist0
NLPml~20 mins

Choosing number of topics in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Topic Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Why is choosing the number of topics important in topic modeling?

In topic modeling, selecting the number of topics affects the model's usefulness. Why is this choice important?

ABecause too few topics may merge distinct themes, and too many topics may split themes unnecessarily.
BBecause the number of topics determines the size of the input data.
CBecause the number of topics controls the speed of the training algorithm only.
DBecause the number of topics decides the number of words in the vocabulary.
Attempts:
2 left
💡 Hint

Think about how topics represent themes in the data.

Metrics
intermediate
1:30remaining
Which metric helps decide the optimal number of topics?

When training topic models, which metric is commonly used to evaluate and choose the best number of topics?

AMean Squared Error, which measures prediction error.
BAccuracy, which measures correct topic labels.
CF1 Score, which balances precision and recall.
DPerplexity, which measures how well the model predicts unseen data.
Attempts:
2 left
💡 Hint

Think about a metric that measures prediction quality on new data.

Predict Output
advanced
2:00remaining
Output of perplexity calculation for different topic numbers

Given the code below that computes perplexity for different numbers of topics, what is the output?

NLP
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

texts = ["apple banana fruit", "banana orange fruit", "car truck vehicle", "truck bus vehicle"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

perplexities = {}
for n_topics in [2, 3]:
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=0)
    lda.fit(X)
    perplexities[n_topics] = lda.perplexity(X)

print(perplexities)
A{2: 9.0, 3: 9.0}
B{2: 9.0, 3: 7.5}
C{2: 7.5, 3: 9.0}
D{2: 7.5, 3: 7.5}
Attempts:
2 left
💡 Hint

Lower perplexity means better model fit.

Model Choice
advanced
1:30remaining
Choosing number of topics using coherence score

You want to select the number of topics for an LDA model using coherence score. Which approach is best?

ASelect the topic count randomly and trust the model to adjust.
BPick the topic count with lowest perplexity without checking coherence.
CTrain models with different topic counts and pick the one with highest coherence score.
DChoose the topic count based on the largest vocabulary size.
Attempts:
2 left
💡 Hint

Coherence measures how interpretable topics are.

🔧 Debug
expert
2:00remaining
Why does increasing topics beyond a point worsen model quality?

After increasing the number of topics in your LDA model beyond 10, you notice coherence scores drop and topics become less meaningful. What is the most likely cause?

AThe model is overfitting by splitting coherent topics into smaller, less meaningful ones.
BThe model is underfitting because it has too few topics to capture data complexity.
CThe vocabulary size is too small to support more topics.
DThe training data is too large, causing the model to fail.
Attempts:
2 left
💡 Hint

Think about what happens when a model tries to create too many topics.

Practice

(1/5)
1. Why is it important to choose the right number of topics in topic modeling?
easy
A. To find clear and meaningful groups in the text data
B. To make the model run faster regardless of quality
C. To reduce the size of the text documents
D. To avoid using any stop words in the text

Solution

  1. Step 1: Understand the goal of topic modeling

    Topic modeling groups similar words and documents into topics to find hidden themes.
  2. Step 2: Importance of topic number choice

    Choosing the right number of topics helps get clear, meaningful groups instead of too broad or too many confusing topics.
  3. Final Answer:

    To find clear and meaningful groups in the text data -> Option A
  4. Quick Check:

    Right topic number = clear groups [OK]
Hint: Right topic count = clear groups, not too few or many [OK]
Common Mistakes:
  • Thinking speed is the main reason to choose topic number
  • Believing topic number reduces document size
  • Confusing stop words removal with topic number choice
2. Which of the following is the correct way to set the number of topics in a typical LDA model using Python's gensim library?
easy
A. lda_model = LdaModel(corpus, n_topics=5, id2word=dictionary)
B. lda_model = LdaModel(corpus, topics=5, id2word=dictionary)
C. lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary)
D. lda_model = LdaModel(corpus, topic_number=5, id2word=dictionary)

Solution

  1. Step 1: Recall gensim LDA parameter names

    The correct parameter to set number of topics is num_topics.
  2. Step 2: Check each option

    Only lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary) uses num_topics=5, others use incorrect parameter names.
  3. Final Answer:

    lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary) -> Option C
  4. Quick Check:

    Parameter name for topics = num_topics [OK]
Hint: Use 'num_topics' parameter to set topic count in gensim LDA [OK]
Common Mistakes:
  • Using 'topics' or 'n_topics' instead of 'num_topics'
  • Confusing parameter names from other libraries
  • Omitting the id2word dictionary parameter
3. Given the following code snippet using sklearn's NMF for topic modeling, what will be the shape of the matrix W if n_components=4 and the input X has shape (100, 500)?
from sklearn.decomposition import NMF
model = NMF(n_components=4, random_state=42)
W = model.fit_transform(X)
medium
A. (4, 4)
B. (4, 500)
C. (100, 500)
D. (100, 4)

Solution

  1. Step 1: Understand NMF output matrices

    NMF factorizes X (samples x features) into W (samples x components) and H (components x features).
  2. Step 2: Apply shapes to given data

    X shape is (100, 500), n_components=4, so W shape is (100, 4).
  3. Final Answer:

    (100, 4) -> Option D
  4. Quick Check:

    W shape = samples x components = (100, 4) [OK]
Hint: W shape = number of samples by number of topics/components [OK]
Common Mistakes:
  • Confusing W with H matrix shape
  • Mixing up rows and columns in matrix shapes
  • Assuming output shape equals input shape
4. You ran LDA with num_topics=10 but found many topics have very similar top words. What is the likely issue and how to fix it?
medium
A. Too few topics chosen; increase num_topics to get more variety
B. Too many topics chosen; reduce num_topics to get clearer topics
C. Stop words were not removed; remove stop words to fix
D. The dictionary is too small; add more words to dictionary

Solution

  1. Step 1: Analyze similar topics with many overlaps

    If many topics share similar top words, it means topics are not distinct enough, often due to too many topics.
  2. Step 2: Adjust number of topics

    Reducing num_topics helps merge similar topics into clearer, distinct groups.
  3. Final Answer:

    Too many topics chosen; reduce num_topics to get clearer topics -> Option B
  4. Quick Check:

    Similar topics = too many topics [OK]
Hint: Similar topics? Try fewer topics for clarity [OK]
Common Mistakes:
  • Increasing topics when topics are already too similar
  • Blaming stop words without checking topic overlap
  • Adding words to dictionary without checking topic count
5. You have a large collection of news articles and want to find topics. You try 3, 5, 10, and 20 topics. The 3-topic model groups articles too broadly, and the 20-topic model creates many overlapping topics. How should you decide the best number of topics?
hard
A. Choose the number that balances clear, distinct topics without too much overlap, often between 5 and 10
B. Always pick the highest number of topics for more detail
C. Pick the lowest number of topics to keep it simple
D. Randomly select a number since topic modeling is unsupervised

Solution

  1. Step 1: Understand the trade-off in topic numbers

    Too few topics cause broad groups; too many cause overlap and confusion.
  2. Step 2: Choose a balanced number

    Testing multiple values and selecting one with clear, distinct topics (often between extremes) is best practice.
  3. Final Answer:

    Choose the number that balances clear, distinct topics without too much overlap, often between 5 and 10 -> Option A
  4. Quick Check:

    Balance topic count for clarity and detail [OK]
Hint: Balance topic count: not too few, not too many [OK]
Common Mistakes:
  • Always picking max topics without checking overlap
  • Choosing too few topics ignoring broadness
  • Ignoring evaluation of topic quality