Bird
Raised Fist0
NLPml~20 mins

Why topic modeling discovers themes in NLP - Experiment to Prove It

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Why topic modeling discovers themes
Problem:We want to find hidden themes in a collection of text documents using topic modeling. The current model uses Latent Dirichlet Allocation (LDA) but the topics found are not very clear or meaningful.
Current Metrics:Coherence score: 0.35 (low coherence means topics are not very interpretable)
Issue:The model finds topics but they are not distinct or easy to understand. This means the themes discovered are weak or mixed.
Your Task
Improve the topic modeling so that the discovered themes are clearer and more meaningful, aiming for a coherence score above 0.5.
Use the same dataset of text documents
Use LDA model only
Do not change the number of topics drastically (keep between 5 and 10)
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
from gensim.models.phrases import Phrases, Phraser

# Sample documents
texts = [
    'Cats are small animals that like to climb trees.',
    'Dogs are loyal and friendly pets.',
    'Birds can fly and sing beautiful songs.',
    'Fish swim in water and have scales.',
    'Cats and dogs can live together peacefully.',
    'Birds build nests to lay eggs.',
    'Fish live in oceans, rivers, and lakes.',
    'Dogs need walks and exercise daily.',
    'Cats like to chase mice and birds.',
    'Birds migrate during winter to warmer places.'
]

# Preprocessing function
def preprocess(texts):
    return [simple_preprocess(doc, deacc=True) for doc in texts]

# Preprocess texts
processed_texts = preprocess(texts)

# Build bigrams
bigram = Phrases(processed_texts, min_count=1, threshold=2)
bigram_mod = Phraser(bigram)
texts_bigrams = [bigram_mod[doc] for doc in processed_texts]

# Create dictionary and corpus
id2word = corpora.Dictionary(texts_bigrams)
# Filter extremes to remove very rare and very common words
id2word.filter_extremes(no_below=1, no_above=0.8)
corpus = [id2word.doc2bow(text) for text in texts_bigrams]

# Train LDA model with tuned parameters
lda_model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=5,
    random_state=100,
    update_every=1,
    chunksize=10,
    passes=20,
    alpha='auto',
    per_word_topics=True
)

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print(f'Coherence Score: {coherence_lda:.2f}')
Added bigram phrase detection to capture common word pairs
Increased number of passes to 20 for better training
Filtered out very common and very rare words from dictionary
Set alpha parameter to 'auto' for better topic distribution
Reduced number of topics to 5 for clearer themes
Results Interpretation

Before tuning, the coherence score was 0.35, indicating weak and unclear topics. After tuning preprocessing and model parameters, the coherence score improved to 0.62, showing that the topics discovered are more meaningful and distinct.

Better preprocessing and tuning help topic modeling find clearer themes. Capturing phrases and filtering words improves the model's understanding of text patterns.
Bonus Experiment
Try using trigrams instead of bigrams and see if the coherence score improves further.
💡 Hint
Use gensim's Phrases with higher n-gram range and adjust threshold to capture meaningful three-word phrases.

Practice

(1/5)
1. Why does topic modeling help discover themes in a collection of documents?
easy
A. Because it groups words that often appear together, revealing common ideas
B. Because it translates documents into different languages
C. Because it counts the number of sentences in each document
D. Because it removes all stop words from the text

Solution

  1. Step 1: Understand the goal of topic modeling

    Topic modeling aims to find hidden themes by grouping words that frequently appear together in documents.
  2. Step 2: Recognize how grouping words reveals themes

    Words that co-occur often represent a shared idea or theme, so grouping them helps discover these themes.
  3. Final Answer:

    Because it groups words that often appear together, revealing common ideas -> Option A
  4. Quick Check:

    Grouping co-occurring words = Discover themes [OK]
Hint: Topic modeling groups co-occurring words to find themes [OK]
Common Mistakes:
  • Thinking topic modeling translates text
  • Confusing word counts with sentence counts
  • Believing stop word removal finds themes
2. Which of the following is the correct way to represent documents for Latent Dirichlet Allocation (LDA)?
easy
A. A sequence of document titles only
B. A matrix of word counts per document
C. A list of document lengths in characters
D. A set of document publication dates

Solution

  1. Step 1: Recall LDA input format

    LDA requires a matrix where each row is a document and each column is a word count, showing how often each word appears in each document.
  2. Step 2: Eliminate incorrect options

    Document lengths, titles, or dates do not provide word frequency information needed for LDA.
  3. Final Answer:

    A matrix of word counts per document -> Option B
  4. Quick Check:

    LDA input = word count matrix [OK]
Hint: LDA uses word count matrices as input [OK]
Common Mistakes:
  • Using document titles instead of word counts
  • Confusing document length with word frequency
  • Including metadata like dates as input
3. Given the following simplified topic-word distribution from LDA:
Topic 1: {"apple": 0.4, "banana": 0.3, "fruit": 0.3}
Topic 2: {"car": 0.5, "engine": 0.3, "wheel": 0.2}
Which theme does Topic 1 most likely represent?
medium
A. Vehicles and parts
B. Sports equipment
C. Technology gadgets
D. Fruits and food

Solution

  1. Step 1: Analyze the top words in Topic 1

    Words like "apple", "banana", and "fruit" are all related to food, specifically fruits.
  2. Step 2: Match words to a theme

    These words clearly indicate the theme is about fruits and food, not vehicles, technology, or sports.
  3. Final Answer:

    Fruits and food -> Option D
  4. Quick Check:

    Topic words = Fruits theme [OK]
Hint: Top words reveal the theme quickly [OK]
Common Mistakes:
  • Confusing 'apple' as a tech brand only
  • Ignoring the presence of 'fruit' word
  • Mixing topics with unrelated themes
4. You run LDA on a set of documents but get topics that mix unrelated words like 'apple' and 'engine' together. What is the most likely cause?
medium
A. The documents were not preprocessed to remove stop words and noise
B. The number of topics chosen is too high
C. The word counts matrix was sorted alphabetically
D. The documents are too short to find any topics

Solution

  1. Step 1: Understand the effect of preprocessing

    Without removing stop words and noise, unrelated words can appear together, confusing the model.
  2. Step 2: Evaluate other options

    Too many topics usually separate words more; sorting word counts does not affect modeling; short documents may reduce quality but not cause mixed unrelated words.
  3. Final Answer:

    The documents were not preprocessed to remove stop words and noise -> Option A
  4. Quick Check:

    Preprocessing needed to avoid mixed topics [OK]
Hint: Always preprocess text before topic modeling [OK]
Common Mistakes:
  • Blaming topic number without checking preprocessing
  • Thinking sorting affects topic quality
  • Assuming short documents cause unrelated word mixing
5. You want to discover themes in a large set of customer reviews using topic modeling. Which approach will best help interpret the discovered topics?
hard
A. Sort reviews by length before modeling
B. Count the total number of words in all reviews
C. Look at the top words in each topic to understand the main ideas
D. Use only the first sentence of each review for modeling

Solution

  1. Step 1: Understand how to interpret topics

    Topic modeling outputs topics as groups of words with probabilities. The top words show the main ideas of each topic.
  2. Step 2: Evaluate other options

    Counting words or sorting reviews does not help interpret themes. Using only first sentences loses information.
  3. Final Answer:

    Look at the top words in each topic to understand the main ideas -> Option C
  4. Quick Check:

    Top words reveal topic meaning [OK]
Hint: Top words explain topic themes clearly [OK]
Common Mistakes:
  • Ignoring top words for interpretation
  • Focusing on review length instead of content
  • Using incomplete text for modeling