Combining concept hierarchies and statistical topic models

Authors:
Chaitanya Chemudugunta;Padhraic Smyth;Mark Steyvers
Affiliations:
University of California, Irvine, Irvine, CA, USA;University of California, Irvine, Irvine, CA, USA;University of California, Irvine, Irvine, CA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 4
Cited 5

Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Mixtures of hierarchical topics with Pachinko allocation

Proceedings of the 24th international conference on Machine learning
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

ISWC '08 Proceedings of the 7th International Conference on The Semantic Web

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

ISWC '08 Proceedings of the 7th International Conference on The Semantic Web
Using Topic Models to Interpret MEDLINE's Medical Subject Headings

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Semantic topic models: combining word distributional statistics and dictionary definitions

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
SSHLDA: a semi-supervised hierarchical topic model

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Hierarchical topic integration through semi-supervised hierarchical topic modeling

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, on the other hand, tend to be semantically richer due to careful selection of words to define concepts but they tend not to cover the themes in a data set exhaustively. In this paper, we propose a probabilistic framework to combine a hierarchy of human-defined semantic concepts with statistical topic models to seek the best of both worlds. Experimental results using two different sources of concept hierarchies and two collections of text documents indicate that this combination leads to systematic improvements in the quality of the associated language models as well as enabling new techniques for inferring and visualizing the semantics of a document.