Unsupervised text segmentation using LDA and MCMC

Authors:
Kaimin Yu;Zhe Li;Genliang Guan;Zhiyong Wang;David Feng
Affiliations:
University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia
Venue:
AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134
Year:
2012

Citing 17
Cited 0

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Latent dirichlet allocation

The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Discourse segmentation by human and automated means

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A Dynamic Programming Algorithm for Linear Text Segmentation

Journal of Intelligent Information Systems
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Text segmentation with LDA-based Fisher kernel

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Using LDA to detect semantically incoherent documents

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Text segmentation via topic modeling: an analytical study

Proceedings of the 18th ACM conference on Information and knowledge management
Improving text segmentation with non-systematic semantic relation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Text segmentation: A topic modeling perspective

Information Processing and Management: an International Journal
Linear text segmentation using affinity propagation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
TV news story segmentation based on semantic coherence and content similarity

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Video scene segmentation using Markov chain Monte Carlo

IEEE Transactions on Multimedia
TopicTiling: a text segmentation algorithm based on LDA

ACL '12 Proceedings of ACL 2012 Student Research Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a data driven approach to text segmentation, while most of the existing unsupervised methods determine segmentation boundaries by empirically exploring similarity measurement between adjacent units (e.g. sentences). Firstly, we train a latent Dirichlet allocation (LDA) model with the large scale Wikipedia Corpus to avoid the problem of vocabulary mismatch, which makes our approach domain-independent. Secondly, each segment unit is represented with a distribution of the topics, instead of a set of word tokens. Finally, a text input is modeled as a sequence of segment units and Markov Chain Monte Carlo technique is employed to decide the appropriate boundaries. The major advantage of using MCMC is its ability to detect both strong and weak boundaries. Experimental results demonstrate that our proposed approach achieve promising results on a widely used benchmark dataset when compared with the state-of-the-art methods.