Optimizing semantic coherence in topic models

  • Authors:
  • David Mimno;Hanna M. Wallach;Edmund Talley;Miriam Leenders;Andrew McCallum

  • Affiliations:
  • Princeton University, Princeton, NJ;University of Massachusetts, Amherst, Amherst, MA;National Institutes of Health, Bethesda, MD;National Institutes of Health, Bethesda, MD;University of Massachusetts, Amherst, Amherst, MA

  • Venue:
  • EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).