Bayesian unsupervised topic segmentation

Authors:
Jacob Eisenstein;Regina Barzilay
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;Massachusetts Institute of Technology, Cambridge, MA
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 18
Cited 30

Attention, intentions, and the structure of discourse

Computational Linguistics
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Latent dirichlet allocation

The Journal of Machine Learning Research
Empirical studies on the disambiguation of cue phrases

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Intention-based segmentation: human reliability and correlation with linguistic cues

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Combining multiple knowledge sources for discourse segmentation

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Entropy rate constancy in text

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Broad coverage paragraph segmentation across languages and domains

ACM Transactions on Speech and Language Processing (TSLP)
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Unsupervised topic modelling for multi-party spoken discourse

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Context-based message expansion for disentanglement of interleaved text conversations

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Hierarchical text segmentation from multi-scale lexical cohesion

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Exploring content models for multi-document summarization

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Global models of document structure using latent permutations

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Learning semantic correspondences with less supervision

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Participant subjectivity and involvement as a basis for discourse segmentation

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Content modeling using latent permutations

Journal of Artificial Intelligence Research
Evaluating hierarchical discourse segmentation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised discourse segmentation of documents with inherently parallel structure

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Multi-document topic segmentation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A rhetorical syntax-driven model for speech summarization

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Improving text segmentation with non-systematic semantic relation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
An iterative approach to text segmentation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
SciSumm: a multi-document summarization system for scientific articles

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Discovery of topically coherent sentences for extractive summarization

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Disentangling chat with local coherence models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised segmentation of bibliographic elements with latent permutations

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Linear text segmentation using affinity propagation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Topical segmentation: a study of human performance and a new measure of quality

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Discourse structure and computation: past, present and future

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
SITS: a hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Modelling sequential text with an adaptive topic model

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Two-part segmentation of text documents

Proceedings of the 21st ACM international conference on Information and knowledge management
Discourse structure and language technology

Natural Language Engineering
Unsupervised Segmentation of Bibliographic Elements with Latent Permutations

International Journal of Organizational and Collective Intelligence
An unsupervised topic segmentation model incorporating word order

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Triggering effective social support for online groups

ACM Transactions on Interactive Intelligent Systems (TiiS)
On handling textual errors in latent document modeling

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Topic segmentation and labeling in asynchronous conversations

Journal of Artificial Intelligence Research
A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimization

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a novel Bayesian approach to unsupervised topic segmentation. Unsupervised systems for this task are driven by lexical cohesion: the tendency of well-formed segments to induce a compact and consistent lexical distribution. We show that lexical cohesion can be placed in a Bayesian context by modeling the words in each topic segment as draws from a multinomial language model associated with the segment; maximizing the observation likelihood in such a model yields a lexically-cohesive segmentation. This contrasts with previous approaches, which relied on hand-crafted cohesion metrics. The Bayesian framework provides a principled way to incorporate additional features such as cue phrases, a powerful indicator of discourse structure that has not been previously used in unsupervised segmentation systems. Our model yields consistent improvements over an array of state-of-the-art systems on both text and speech datasets. We also show that both an entropy-based analysis and a well-known previous technique can be derived as special cases of the Bayesian framework.