Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages
Computational Linguistics
Discourse segmentation by human and automated means
Computational Linguistics
Advances in domain independent linear text segmentation
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A Dynamic Programming Algorithm for Linear Text Segmentation
Journal of Intelligent Information Systems
A statistical model for domain-independent text segmentation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Text segmentation with LDA-based Fisher kernel
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Using LDA to detect semantically incoherent documents
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Text segmentation via topic modeling: an analytical study
Proceedings of the 18th ACM conference on Information and knowledge management
Improving text segmentation with non-systematic semantic relation
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Text segmentation: A topic modeling perspective
Information Processing and Management: an International Journal
Linear text segmentation using affinity propagation
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
TV news story segmentation based on semantic coherence and content similarity
MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Video scene segmentation using Markov chain Monte Carlo
IEEE Transactions on Multimedia
TopicTiling: a text segmentation algorithm based on LDA
ACL '12 Proceedings of ACL 2012 Student Research Workshop
Hi-index | 0.00 |
In this paper, we propose a data driven approach to text segmentation, while most of the existing unsupervised methods determine segmentation boundaries by empirically exploring similarity measurement between adjacent units (e.g. sentences). Firstly, we train a latent Dirichlet allocation (LDA) model with the large scale Wikipedia Corpus to avoid the problem of vocabulary mismatch, which makes our approach domain-independent. Secondly, each segment unit is represented with a distribution of the topics, instead of a set of word tokens. Finally, a text input is modeled as a sequence of segment units and Markov Chain Monte Carlo technique is employed to decide the appropriate boundaries. The major advantage of using MCMC is its ability to detect both strong and weak boundaries. Experimental results demonstrate that our proposed approach achieve promising results on a widely used benchmark dataset when compared with the state-of-the-art methods.