0
0
NLPml~5 mins

LDA with scikit-learn in NLP

Choose your learning style9 modes available
Introduction
LDA helps find hidden topics in a collection of texts. It groups words that often appear together to understand the main themes.
You want to discover topics in a set of news articles.
You need to organize customer reviews by themes without reading all of them.
You want to summarize large text data by main ideas.
You want to explore themes in social media posts.
You want to reduce text complexity for easier analysis.
Syntax
NLP
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=number_of_topics, random_state=seed)
lda.fit(document_term_matrix)
n_components sets how many topics you want to find.
document_term_matrix is a matrix where rows are documents and columns are word counts.
Examples
Create an LDA model to find 3 topics and fit it to data X.
NLP
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(X)
Find 5 topics with up to 10 iterations for better results.
NLP
lda = LatentDirichletAllocation(n_components=5, max_iter=10, random_state=0)
lda.fit(X)
Sample Model
This program finds 2 topics in 5 short texts. It prints the top 3 words for each topic to show what the topic is about.
NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample documents
texts = [
    'I love reading books about science and technology',
    'The new movie was exciting and full of action',
    'Technology advances help science progress',
    'Action movies are thrilling and fun to watch',
    'Books on science explain complex ideas clearly'
]

# Convert texts to a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

# Create LDA model to find 2 topics
lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(X)

# Show top words for each topic
n_top_words = 3
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")
OutputSuccess
Important Notes
LDA works best with many documents and a good number of words.
Removing common words (stop words) helps LDA find better topics.
You can tune n_components to get more or fewer topics.
Summary
LDA finds hidden topics by grouping words that appear together.
Use CountVectorizer to turn text into numbers for LDA.
Check top words per topic to understand what each topic means.