A phrase-discovering topic model using hierarchical Pitman-Yor processes

Authors:
Robert V. Lindsey;William P. Headden, III;Michael J. Stipicevic
Affiliations:
University of Colorado, Boulder;Two Cassowaries Inc.;Google Inc.
Venue:
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Year:
2012

Citing 5
Cited 3

Latent dirichlet allocation

The Journal of Machine Learning Research
Topic modeling: beyond bag-of-words

ICML '06 Proceedings of the 23rd international conference on Machine learning
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

An n-gram topic model for time-stamped documents

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
An unsupervised topic segmentation model incorporating word order

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Supervised N-gram topic model

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic models traditionally rely on the bag-of-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bag-of-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.