Unsupervised text segmentation using LDA and MCMC

  • Authors:
  • Kaimin Yu;Zhe Li;Genliang Guan;Zhiyong Wang;David Feng

  • Affiliations:
  • University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia

  • Venue:
  • AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a data driven approach to text segmentation, while most of the existing unsupervised methods determine segmentation boundaries by empirically exploring similarity measurement between adjacent units (e.g. sentences). Firstly, we train a latent Dirichlet allocation (LDA) model with the large scale Wikipedia Corpus to avoid the problem of vocabulary mismatch, which makes our approach domain-independent. Secondly, each segment unit is represented with a distribution of the topics, instead of a set of word tokens. Finally, a text input is modeled as a sequence of segment units and Markov Chain Monte Carlo technique is employed to decide the appropriate boundaries. The major advantage of using MCMC is its ability to detect both strong and weak boundaries. Experimental results demonstrate that our proposed approach achieve promising results on a widely used benchmark dataset when compared with the state-of-the-art methods.