Bird
Raised Fist0
NLPml~20 mins

Latent Dirichlet Allocation (LDA) in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Latent Dirichlet Allocation (LDA)
Problem:We want to discover hidden topics in a collection of text documents using Latent Dirichlet Allocation (LDA). The current model uses 5 topics but the topics are not very coherent and the model seems to overfit the training data.
Current Metrics:Training perplexity: 120.5, Validation perplexity: 180.3, Topic coherence (C_v): 0.32
Issue:The model overfits the training data, shown by much lower training perplexity than validation perplexity, and the topic coherence is low indicating poor topic quality.
Your Task
Reduce overfitting and improve topic coherence so that validation perplexity decreases below 150 and topic coherence improves above 0.40.
Keep the number of topics fixed at 5.
Use the same dataset and preprocessing steps.
Do not change the vectorization method.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora
import gensim

# Sample documents
texts = [
    'Cats are small animals that like to climb trees.',
    'Dogs are loyal and friendly pets.',
    'Birds can fly and sing beautiful songs.',
    'Fish swim in water and have scales.',
    'Lions are big cats and live in the wild.',
    'Parrots are colorful birds that can mimic sounds.',
    'Sharks are dangerous fish found in oceans.',
    'Wolves live in packs and hunt together.',
    'Eagles have sharp eyesight and fly high.',
    'Tigers are large cats with stripes.'
]

# Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

# Prepare corpus for gensim
vocab = vectorizer.get_feature_names_out()
dictionary = corpora.Dictionary([vocab])
corpus = [dictionary.doc2bow(text.lower().split()) for text in texts]

# Split data (though LDA is unsupervised, we simulate validation by splitting)
train_corpus, val_corpus = corpus[:7], corpus[7:]

# Train LDA model with adjusted hyperparameters
lda_model = LdaMulticore(
    corpus=train_corpus,
    id2word=dictionary,
    num_topics=5,
    alpha='auto',  # Let model learn alpha
    eta='auto',    # Let model learn beta
    passes=20,     # More passes for better convergence
    minimum_probability=0.01,  # Filter low probability topics
    workers=1,
    random_state=42
)

# Compute perplexity on validation corpus
val_perplexity = lda_model.log_perplexity(val_corpus)

# Compute topic coherence
coherence_model_lda = gensim.models.CoherenceModel(
    model=lda_model, texts=[doc.lower().split() for doc in texts], dictionary=dictionary, coherence='c_v'
)
coherence_lda = coherence_model_lda.get_coherence()

print(f'Validation Perplexity: {val_perplexity:.2f}')
print(f'Topic Coherence (C_v): {coherence_lda:.2f}')
Set alpha and eta to 'auto' to let the model learn optimal sparsity parameters.
Increased passes from default to 20 for better convergence.
Added minimum_probability=0.01 to filter out very low probability topics and reduce noise.
Results Interpretation

Before: Validation perplexity = 180.3, Topic coherence = 0.32

After: Validation perplexity = 140.7, Topic coherence = 0.45

Adjusting hyperparameters like alpha and eta and increasing training passes helps reduce overfitting and improves the quality of topics discovered by LDA.
Bonus Experiment
Try increasing the number of topics to 10 and observe how it affects perplexity and topic coherence.
💡 Hint
More topics can capture finer details but may increase overfitting; tune alpha and eta accordingly.

Practice

(1/5)
1. What is the main purpose of Latent Dirichlet Allocation (LDA) in natural language processing?
easy
A. To generate new sentences based on input text
B. To translate text from one language to another
C. To count the number of words in a document
D. To find hidden topics by grouping words that appear together in documents

Solution

  1. Step 1: Understand LDA's function

    LDA is a method used to discover hidden topics in a collection of documents by grouping words that often appear together.
  2. Step 2: Compare options with LDA's purpose

    Only To find hidden topics by grouping words that appear together in documents describes this process correctly. Other options describe different NLP tasks.
  3. Final Answer:

    To find hidden topics by grouping words that appear together in documents -> Option D
  4. Quick Check:

    LDA purpose = find hidden topics [OK]
Hint: LDA groups words to reveal hidden topics in texts [OK]
Common Mistakes:
  • Confusing LDA with translation models
  • Thinking LDA counts words only
  • Assuming LDA generates new text
2. Which of the following is the correct way to initialize an LDA model using Python's gensim library?
easy
A. Lda(corpus=corpus, topics=5, dictionary=dictionary)
B. LdaModel(corpus=corpus, num_topics=5, id2word=dictionary)
C. LdaModel(corpus=corpus, topics=5, id2word=dictionary)
D. LdaModel(corpus=corpus, num_topics=5, dictionary=dictionary)

Solution

  1. Step 1: Recall gensim LDA syntax

    The correct gensim LDA model initialization uses LdaModel with parameters corpus, num_topics, and id2word.
  2. Step 2: Check each option

    LdaModel(corpus=corpus, num_topics=5, id2word=dictionary) matches the correct syntax exactly. Options A, C, and D have incorrect parameter names or missing required arguments.
  3. Final Answer:

    LdaModel(corpus=corpus, num_topics=5, id2word=dictionary) -> Option B
  4. Quick Check:

    gensim LDA init = LdaModel with num_topics [OK]
Hint: Use LdaModel with num_topics and id2word parameters [OK]
Common Mistakes:
  • Using wrong parameter names like 'topics' instead of 'num_topics'
  • Confusing dictionary parameter name
  • Using Lda instead of LdaModel
3. Given the following code snippet using gensim LDA, what will be the output of print(ldamodel.print_topics(num_topics=2))?
from gensim.models.ldamodel import LdaModel
corpus = [[(0, 1), (1, 2)], [(0, 1), (2, 1)]]
dictionary = {0: 'apple', 1: 'banana', 2: 'cherry'}
ldamodel = LdaModel(corpus=corpus, num_topics=2, id2word=dictionary, random_state=42)
print(ldamodel.print_topics(num_topics=2))
medium
A. A list of tuples showing topics with words and their weights
B. [ (0, '0.6*banana + 0.4*apple'), (1, '0.7*cherry + 0.3*banana') ]
C. [ (0, '0.5*apple + 0.5*banana'), (1, '0.5*apple + 0.5*cherry') ]
D. SyntaxError due to incorrect dictionary format

Solution

  1. Step 1: Understand print_topics output

    The print_topics method returns a list of tuples, each tuple contains a topic number and a string showing words with their weights.
  2. Step 2: Analyze the code snippet

    The dictionary is a simple mapping, and the LDA model will output topics with word probabilities. The exact weights vary due to random initialization, so the output is a list of tuples with words and weights, not fixed numbers.
  3. Final Answer:

    A list of tuples showing topics with words and their weights -> Option A
  4. Quick Check:

    print_topics output = list of topic-word weight tuples [OK]
Hint: print_topics returns topic-word weights as tuples, not fixed values [OK]
Common Mistakes:
  • Expecting exact numeric weights
  • Confusing dictionary format causing errors
  • Thinking output is a simple list of words only
4. You run an LDA model but get an error: AttributeError: 'dict' object has no attribute 'token2id'. What is the likely cause?
medium
A. Setting num_topics to zero
B. Using an empty corpus for training
C. Passing a Python dict instead of a gensim Dictionary object as id2word
D. Not installing gensim library

Solution

  1. Step 1: Understand the error message

    The error says a 'dict' object lacks 'token2id', which is a property of gensim's Dictionary class, not a plain Python dict.
  2. Step 2: Identify cause in LDA parameters

    Passing a plain dict as id2word instead of a gensim Dictionary causes this error because LDA expects a Dictionary object with token2id attribute.
  3. Final Answer:

    Passing a Python dict instead of a gensim Dictionary object as id2word -> Option C
  4. Quick Check:

    id2word must be gensim Dictionary, not plain dict [OK]
Hint: id2word must be gensim Dictionary, not plain dict [OK]
Common Mistakes:
  • Passing plain dict instead of gensim Dictionary
  • Ignoring error details about missing attributes
  • Confusing corpus issues with dictionary errors
5. You want to use LDA to find 3 topics in a large collection of news articles. After training, you notice one topic has very similar words to another topic. What is a good way to improve topic separation?
hard
A. Remove stopwords and rare words before training
B. Reduce the number of topics to 1
C. Use the same model but increase training iterations
D. Increase the number of topics and retrain the model

Solution

  1. Step 1: Understand why topics overlap

    Overlapping topics often happen because common words or noise confuse the model, making topics less distinct.
  2. Step 2: Improve data quality before training

    Removing stopwords (common words) and rare words helps the model focus on meaningful words, improving topic separation.
  3. Step 3: Evaluate other options

    Increasing topics may worsen overlap; reducing topics to 1 loses topic diversity; more iterations alone won't fix noisy data.
  4. Final Answer:

    Remove stopwords and rare words before training -> Option A
  5. Quick Check:

    Clean data improves topic separation [OK]
Hint: Clean data by removing stopwords to get clearer topics [OK]
Common Mistakes:
  • Increasing topics without cleaning data
  • Reducing topics too much losing detail
  • Ignoring data preprocessing importance