Optimizing semantic coherence in topic models

Authors:
David Mimno;Hanna M. Wallach;Edmund Talley;Miriam Leenders;Andrew McCallum
Affiliations:
Princeton University, Princeton, NJ;University of Massachusetts, Amherst, Amherst, MA;National Institutes of Health, Bethesda, MD;National Institutes of Health, Bethesda, MD;University of Massachusetts, Amherst, Amherst, MA
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 9
Cited 16

Word association norms, mutual information, and lexicography

Computational Linguistics
Latent dirichlet allocation

The Journal of Machine Learning Research
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic labeling of multinomial topic models

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating domain knowledge into topic modeling via Dirichlet Forest priors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Accounting for burstiness in topic models

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Evaluation methods for topic models

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Topic Significance Ranking of LDA Generative Models

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Automatic evaluation of topic coherence

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Termite: visualization techniques for assessing textual topic models

Proceedings of the International Working Conference on Advanced Visual Interfaces
Supervised HDP using prior knowledge

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
SurfShop: combing a product ontology with topic model results for online window-shopping

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session
Evaluating unsupervised ensembles when applied to word sense induction

ACL '12 Proceedings of ACL 2012 Student Research Workshop
Exploring topic coherence over many models and many topics

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Unsupervised graph-based topic labelling using dbpedia

Proceedings of the sixth ACM international conference on Web search and data mining
On collocations and topic models

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2
Representing documents through their readers

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast rank-2 nonnegative matrix factorization for hierarchical document clustering

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
One theme in all views: modeling consensus topics in multiple contexts

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
When relevance is not enough: promoting diversity and freshness in personalized question recommendation

Proceedings of the 22nd international conference on World Wide Web
A biterm topic model for short texts

Proceedings of the 22nd international conference on World Wide Web
Discovering coherent topics using general knowledge

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Experiments with semantic similarity measures based on LDA and LSA

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Leveraging multi-domain prior knowledge in topic models

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Going beyond Corr-LDA for detecting specific comments on news & blogs

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).