NLPml~20 mins

Why topic modeling discovers themes in NLP - Experiment to Prove It

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Why topic modeling discovers themes

Problem:We want to find hidden themes in a collection of text documents using topic modeling. The current model uses Latent Dirichlet Allocation (LDA) but the topics found are not very clear or meaningful.

Current Metrics:Coherence score: 0.35 (low coherence means topics are not very interpretable)

Issue:The model finds topics but they are not distinct or easy to understand. This means the themes discovered are weak or mixed.

Your Task

Improve the topic modeling so that the discovered themes are clearer and more meaningful, aiming for a coherence score above 0.5.

Use the same dataset of text documents

Use LDA model only

Do not change the number of topics drastically (keep between 5 and 10)

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import gensim
from gensim import corpora
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
from gensim.models.phrases import Phrases, Phraser

# Sample documents
texts = [
    'Cats are small animals that like to climb trees.',
    'Dogs are loyal and friendly pets.',
    'Birds can fly and sing beautiful songs.',
    'Fish swim in water and have scales.',
    'Cats and dogs can live together peacefully.',
    'Birds build nests to lay eggs.',
    'Fish live in oceans, rivers, and lakes.',
    'Dogs need walks and exercise daily.',
    'Cats like to chase mice and birds.',
    'Birds migrate during winter to warmer places.'
]

# Preprocessing function
def preprocess(texts):
    return [simple_preprocess(doc, deacc=True) for doc in texts]

# Preprocess texts
processed_texts = preprocess(texts)

# Build bigrams
bigram = Phrases(processed_texts, min_count=1, threshold=2)
bigram_mod = Phraser(bigram)
texts_bigrams = [bigram_mod[doc] for doc in processed_texts]

# Create dictionary and corpus
id2word = corpora.Dictionary(texts_bigrams)
# Filter extremes to remove very rare and very common words
id2word.filter_extremes(no_below=1, no_above=0.8)
corpus = [id2word.doc2bow(text) for text in texts_bigrams]

# Train LDA model with tuned parameters
lda_model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=5,
    random_state=100,
    update_every=1,
    chunksize=10,
    passes=20,
    alpha='auto',
    per_word_topics=True
)

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print(f'Coherence Score: {coherence_lda:.2f}')

Added bigram phrase detection to capture common word pairs

Increased number of passes to 20 for better training

Filtered out very common and very rare words from dictionary

Set alpha parameter to 'auto' for better topic distribution

Reduced number of topics to 5 for clearer themes

Results Interpretation

Before tuning, the coherence score was 0.35, indicating weak and unclear topics. After tuning preprocessing and model parameters, the coherence score improved to 0.62, showing that the topics discovered are more meaningful and distinct.

Better preprocessing and tuning help topic modeling find clearer themes. Capturing phrases and filtering words improves the model's understanding of text patterns.

Bonus Experiment

Try using trigrams instead of bigrams and see if the coherence score improves further.

💡 Hint

Use gensim's Phrases with higher n-gram range and adjust threshold to capture meaningful three-word phrases.

Practice

(1/5)

1. Why does topic modeling help discover themes in a collection of documents?

easy

A. Because it groups words that often appear together, revealing common ideas

B. Because it translates documents into different languages

C. Because it counts the number of sentences in each document

D. Because it removes all stop words from the text

Why topic modeling discovers themes in NLP - Experiment to Prove It

Start learning this pattern below

Practice

Solution

Step 1: Understand the goal of topic modeling

Step 2: Recognize how grouping words reveals themes

Final Answer:

Quick Check:

Solution

Step 1: Recall LDA input format

Step 2: Eliminate incorrect options

Final Answer:

Quick Check:

Solution

Step 1: Analyze the top words in Topic 1

Step 2: Match words to a theme

Final Answer:

Quick Check:

Solution

Step 1: Understand the effect of preprocessing

Step 2: Evaluate other options

Final Answer:

Quick Check:

Solution

Step 1: Understand how to interpret topics

Step 2: Evaluate other options

Final Answer:

Quick Check: