A segmented topic model based on the two-parameter Poisson-Dirichlet process

Authors:
Lan Du;Wray Buntine;Huidong Jin
Affiliations:
Research School of Information Sciences and Engineering, The Australian National University, Canberra, Australia and NICTA, Canberra, Australia;Research School of Information Sciences and Engineering, The Australian National University, Canberra, Australia and NICTA, Canberra, Australia;Research School of Information Sciences and Engineering, The Australian National University, Canberra, Australia and CSIRO Mathematics, Informatics and Statistics, Canberra, Australia
Venue:
Machine Learning
Year:
2010

Citing 0
Cited 7

Sampling table configurations for the hierarchical poisson-dirichlet process

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
A statistical model for topically segmented documents

DS'11 Proceedings of the 14th international conference on Discovery science
Finding expert users in community question answering

Proceedings of the 21st international conference companion on World Wide Web
Sequential entity group topic model for getting topic flows of entity groups within one document

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Modelling sequential text with an adaptive topic model

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Unsupervised topic modeling approaches to decision summarization in spoken meetings

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
An unsupervised topic segmentation model incorporating word order

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.