0
0
NLPml~5 mins

LDA with Gensim in NLP

Choose your learning style9 modes available
Introduction
LDA helps find hidden topics in a bunch of documents. Gensim makes it easy to do this with simple code.
You want to discover main themes in a collection of news articles.
You have customer reviews and want to see common topics people talk about.
You want to organize a large set of emails by their subjects automatically.
You want to explore research papers to find popular research areas.
Syntax
NLP
from gensim import corpora, models

# Prepare data
texts = [['word1', 'word2'], ['word3', 'word4']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train LDA model
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

# Get topics
topics = lda_model.print_topics()
texts is a list of tokenized documents (list of word lists).
corpus is a list of bag-of-words representations of documents.
Examples
Simple example with two documents and two topics.
NLP
texts = [['apple', 'banana', 'apple'], ['banana', 'orange']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=5)
topics = lda_model.print_topics()
Example with three documents and three topics, more passes for better learning.
NLP
texts = [['cat', 'dog'], ['dog', 'mouse'], ['cat', 'mouse', 'dog']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)
topics = lda_model.print_topics()
Sample Model
This program finds 2 topics from 9 small documents using LDA with Gensim.
NLP
from gensim import corpora, models

# Sample documents
texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

# Create dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train LDA model
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20)

# Print topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")
OutputSuccess
Important Notes
More passes usually improve topic quality but take longer to run.
The dictionary maps words to unique IDs needed for the model.
The corpus is a list of documents represented as word ID and count pairs.
Summary
LDA with Gensim finds hidden topics in text collections.
You prepare data by tokenizing, creating a dictionary, and making a corpus.
Train the model with chosen number of topics and passes, then view topics.