Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document

Authors:
Lan Du;Wray Lindsay Buntine;Huidong Jin
Affiliations:
-;-;-
Venue:
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Year:
2010

Citing 0
Cited 6

Sampling table configurations for the hierarchical poisson-dirichlet process

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Mining contentions from discussions and debates

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Sequential entity group topic model for getting topic flows of entity groups within one document

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
SITS: a hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Mixed membership Markov models for unsupervised conversation modeling

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Probabilistic topic models for sequence data

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding how topics within a document evolve over its structure is an interesting and important problem. In this paper, we address this problem by presenting a novel variant of Latent Dirichlet Allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, {\it i.e.}, a document consists of multiple segments ({\it e.g.}, chapters, paragraphs), each of which is correlated to its previous and subsequent segments. In our model, a document and its segments are modelled as random mixtures of the same set of latent topics, each of which is a distribution over words, and the topic distribution of each segment depends on that of its previous segment, the one for first segment will depend on the document topic distribution. The progressive dependency is captured by using the nested two-parameter Poisson Dirichlet process (PDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the PDP. Our experimental results on patent documents show that by taking into account the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on books such as Melville's "The Whale''.