Topic coherence evaluation helps us check if the topics found by a model make sense together. It tells us if the words in a topic are related and easy to understand.
0
0
Topic coherence evaluation in NLP
Introduction
When you want to see if your topic model groups words in a meaningful way.
When comparing different topic models to pick the best one.
When tuning the number of topics to find the most understandable set.
When explaining topics to others and you want clear, coherent themes.
Syntax
NLP
from gensim.models.coherencemodel import CoherenceModel coherence_model = CoherenceModel(model=your_topic_model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v') coherence_score = coherence_model.get_coherence()
model is your trained topic model.
texts are your documents split into words (tokenized).
Examples
Calculate coherence score using the 'c_v' measure for an LDA model.
NLP
coherence_model = CoherenceModel(model=lda_model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
score = coherence_model.get_coherence()Calculate coherence score using 'u_mass' measure with just topic word lists (no model object).
NLP
coherence_model = CoherenceModel(topics=topic_word_lists, texts=tokenized_docs, dictionary=dictionary, coherence='u_mass')
score = coherence_model.get_coherence()Sample Model
This code trains a simple topic model on a few sentences and calculates the coherence score to check how meaningful the topics are.
NLP
import gensim from gensim import corpora from gensim.models import LdaModel from gensim.models.coherencemodel import CoherenceModel # Sample documents documents = [ 'cats like to chase mice', 'dogs like to bark loudly', 'cats and dogs can be friends', 'mice are small and quick', 'dogs bark and cats meow' ] # Tokenize documents tokenized_docs = [doc.lower().split() for doc in documents] # Create dictionary and corpus dictionary = corpora.Dictionary(tokenized_docs) corpus = [dictionary.doc2bow(text) for text in tokenized_docs] # Train LDA model with 2 topics lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, random_state=42) # Calculate coherence score coherence_model = CoherenceModel(model=lda_model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v') coherence_score = coherence_model.get_coherence() print(f'Coherence Score: {coherence_score:.4f}')
OutputSuccess
Important Notes
Higher coherence scores mean topics are more meaningful and related.
Different coherence measures exist; 'c_v' is popular for human interpretability.
Tokenization and cleaning your text well improves coherence results.
Summary
Topic coherence helps measure how understandable topics are.
Use coherence scores to compare and improve topic models.
Simple code with Gensim can calculate coherence easily.