0
0
NLPml~5 mins

Latent Dirichlet Allocation (LDA) in NLP

Choose your learning style9 modes available
Introduction
Latent Dirichlet Allocation (LDA) helps find hidden topics in a collection of texts. It groups words that often appear together to understand what the texts are about.
You want to discover main themes in a large set of news articles.
You need to organize customer reviews by topics without reading all of them.
You want to summarize research papers by their main subjects.
You want to recommend articles based on topics they cover.
You want to explore common topics in social media posts.
Syntax
NLP
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=number_of_topics, random_state=seed)
lda.fit(document_term_matrix)
n_components sets how many topics you want to find.
document_term_matrix is a matrix counting word occurrences in each document.
Examples
Finds 3 topics in the data matrix X with a fixed random seed for reproducibility.
NLP
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(X)
Finds 5 topics without setting a random seed.
NLP
lda = LatentDirichletAllocation(n_components=5)
lda.fit(X)
Sample Model
This program finds 2 topics in 5 short texts. It prints the top 3 words for each topic to show what the topic is about.
NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample documents
texts = [
    'I love reading about machine learning and AI.',
    'AI and machine learning are fascinating fields.',
    'The cat sat on the mat.',
    'Cats and dogs are common pets.',
    'I enjoy walking my dog in the park.'
]

# Convert texts to a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

# Create LDA model to find 2 topics
lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(X)

# Show top words for each topic
n_top_words = 3
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")
OutputSuccess
Important Notes
LDA assumes each document is a mix of topics, and each topic is a mix of words.
Choosing the right number of topics (n_components) is important and may need trial and error.
Stop words (common words like 'the', 'and') should be removed to get better topics.
Summary
LDA finds hidden topics by grouping words that appear together in documents.
It helps organize and understand large collections of text without reading everything.
You set how many topics to find and get words that describe each topic.