Model Pipeline - LDA with scikit-learn
This pipeline uses Latent Dirichlet Allocation (LDA) to find topics in a collection of text documents. It transforms raw text into numbers, then trains the LDA model to discover hidden themes.
Jump into concepts and practice - no test required
This pipeline uses Latent Dirichlet Allocation (LDA) to find topics in a collection of text documents. It transforms raw text into numbers, then trains the LDA model to discover hidden themes.
1200.5 |************ 1100.3 |********** 1050.7 |******** 1025.4 |******* 1010.2 |******
| Epoch | Loss ↓ | Accuracy ↑ | Observation |
|---|---|---|---|
| 1 | 1200.5 | N/A | Initial model fit, high loss as topics are random |
| 2 | 1100.3 | N/A | Loss decreases as topics start to form |
| 3 | 1050.7 | N/A | Model converging, topics clearer |
| 4 | 1025.4 | N/A | Loss stabilizes, good topic separation |
| 5 | 1010.2 | N/A | Final epoch, model ready for prediction |
topic_distribution?
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer docs = ["apple banana apple", "banana orange banana", "apple orange orange"] vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(docs) lda = LatentDirichletAllocation(n_components=2, random_state=0) lda.fit(dtm) topic_distribution = lda.transform(dtm)
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer docs = ["cat dog", "dog mouse", "cat mouse"] vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(docs) lda = LatentDirichletAllocation(n_components=2) lda.fit_transform(dtm) print(lda.components_)