Statistical Models for Text Segmentation

Authors:
Doug Beeferman;Adam Berger;John Lafferty
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Venue:
Machine Learning - Special issue on natural language learning
Year:
1999

Citing 16
Cited 142

A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
A dynamic language model for speech recognition

HLT '91 Proceedings of the workshop on Speech and Natural Language
Informedia Digital Video Library

Communications of the ACM
A maximum entropy approach to natural language processing

Computational Linguistics
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text Segmentation by Topic

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Empirical studies on the disambiguation of cue phrases

Computational Linguistics
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Discourse segmentation by human and automated means

Computational Linguistics
A model of lexical attraction and repulsion

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Combining multiple knowledge sources for discourse segmentation

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
An automatic method of finding topic boundaries

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Adaptive language modeling using the maximum entropy principle

HLT '93 Proceedings of the workshop on Human Language Technology

Additive models, boosting, and inference for generalized divergences

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Topic detection and tracking in English and Chinese

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic segmentation with an aspect hidden Markov model

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Multimedia edges: finding hierarchy in all dimensions

MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
Evaluating automatic melody segmentation aimed at music information retrieval

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Topic-based document segmentation with probabilistic latent semantic analysis

Proceedings of the eleventh international conference on Information and knowledge management
Using micro information units for internet search

Proceedings of the eleventh international conference on Information and knowledge management
Learning-based Intrasentence Segmentation for Efficient Translation of Long Sentences

Machine Translation
A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams

Journal of Intelligent Information Systems
Learning Approaches for Detecting and Tracking News Events

IEEE Intelligent Systems
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Optimal Mixture Models in IR

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Semantic Comparison of Texts for Learning Environments

IBERAMIA 2002 Proceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence
Segmenting Conversations by Topic, Initiative, and Style

Information Retrieval Techniques for Speech Applications [this book is based on the workshop “Information Retrieval Techniques for Speech Applications”, held as part of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in New Orleans, USA, in September 2001].
Topic Detection Using Lexical Chains

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Introduction to topic detection and tracking

Topic detection and tracking
Multi-strategy learning for topic detection and tracking: a joint report of CMU approaches to multilingual TDT

Topic detection and tracking
Segmentation and detection at IBM: hybrid statistical models and two-tiered clustering

Topic detection and tracking
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Domain-independent text segmentation using anisotropic diffusion and dynamic programming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Bursty and Hierarchical Structure in Streams

Data Mining and Knowledge Discovery
Topic analysis using a finite mixture model

Information Processing and Management: an International Journal
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating prosodic and lexical cues for automatic topic segmentation

Computational Linguistics
A statistical information extraction system for Turkish

Natural Language Engineering
A bootstrapping approach for robust topic analysis

Natural Language Engineering
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
SeLeCT: a lexical cohesion based news story segmentation system

AI Communications - STAIRS 2002
Automatic organization for digital photographs with geographic coordinates

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Accessor variety criteria for Chinese word extraction

Computational Linguistics
A Dynamic Programming Algorithm for Linear Text Segmentation

Journal of Intelligent Information Systems
Thematic segmentation of meetings through document/speech alignment

Proceedings of the 12th annual ACM international conference on Multimedia
Using bi-modal alignment and clustering techniques for documents and speech thematic segmentations

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Test Data Likelihood for PLSA Models

Information Retrieval
Web-assisted annotation, semantic indexing and search of television and radio news

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting semantic structure of web documents using content and visual information

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Fine-grained hidden markov modeling for broadcast-news story segmentation

HLT '01 Proceedings of the first international conference on Human language technology research
Using collocations for topic segmentation and link detection

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Spoken and written news story segmentation using lexical chains

NAACLstudent '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Proceedings of the HLT-NAACL 2003 student research workshop - Volume 3
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Score following: state of the art and new developments

NIME '03 Proceedings of the 2003 conference on New interfaces for musical expression
Automated story capture from conversational speech

Proceedings of the 3rd international conference on Knowledge capture
Topic analysis using a finite mixture model

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Utterance segmentation using combined approach based on Bi-directional N-gram and maximum entropy

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Broad coverage paragraph segmentation across languages and domains

ACM Transactions on Speech and Language Processing (TSLP)
Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore (2001)

Computational Linguistics
Combining audio and video to predict helpers' focus of attention in multiparty remote collaboration on physical tasks

Proceedings of the 8th international conference on Multimodal interfaces
The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions

Journal of the American Society for Information Science and Technology
Segmenting meetings into agenda items by extracting implicit supervision from human note-taking

Proceedings of the 12th international conference on Intelligent user interfaces
Unsupervised topic modelling for multi-party spoken discourse

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Enhancing electronic dictionaries with an index based on associations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Understanding the dynamics of collaborative multi-party discourse

Information Visualization - Special issue on visual analysis of human dynamics
User-oriented text segmentation evaluation measure

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Sequential Document Visualization

IEEE Transactions on Visualization and Computer Graphics
Using patterns of thematic progression for building a table of contents of a text

Natural Language Engineering
Discovering the rare opportunity by strategy based interactive value-focused thinking model

International Journal of Knowledge-based and Intelligent Engineering Systems - Chance discovery
Scalable summaries of spoken conversations

Proceedings of the 13th international conference on Intelligent user interfaces
Text Entailment for Logical Segmentation and Summarization

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Structural Observations Regarding RongoRongo Tablet 'Keiti'

Cryptologia
Inter-coder agreement for computational linguistics

Computational Linguistics
SEA: Segment-enrich-annotate paradigm for adapting dialog-based content for improved accessibility

ACM Transactions on Information Systems (TOIS)
Topic development pattern analysis-based adaptation of information spaces

The New Review of Hypermedia and Multimedia - Adaptive Hypermedia
Text segmentation with LDA-based Fisher kernel

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Off-topic detection in conversational telephone speech

ACTS '09 Proceedings of the HLT-NAACL 2006 Workshop on Analyzing Conversations in Text and Speech
Topic segmentation of dialogue

ACTS '09 Proceedings of the HLT-NAACL 2006 Workshop on Analyzing Conversations in Text and Speech
ChAT: a time-linked system for conversational analysis

ACTS '09 Proceedings of the HLT-NAACL 2006 Workshop on Analyzing Conversations in Text and Speech
Unsupervised methods of topical text segmentation for Polish

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Word distributions for thematic segmentation in a support vector machine approach

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Feature-based segmentation of narrative documents

FeatureEng '05 Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing
Bayesian unsupervised topic segmentation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Acquiring domain-specific dialog information from task-oriented human-human interaction through an unsupervised learning

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Prosody-based topic segmentation for Mandarin broadcast news

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Museli: a multi-source evidence integration approach to topic segmentation of spontaneous dialogue

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Story segmentation of brodcast news in English, Mandarin and Arabic

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Global models of document structure using latent permutations

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Making semantic topicality robust through term abstraction

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
Real-time detection of task switches of desktop users

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Assessing prosodic and text features for segmentation of Mandarin broadcast news

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
Towards identifying intervention arms in randomized controlled trials: Extracting coordinating constructions

Journal of Biomedical Informatics
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Locating case discussion segments in recorded medical team meetings

SSCS '09 Proceedings of the third workshop on Searching spontaneous conversational speech
Efficient linear text segmentation based on information retrieval techniques

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Text segmentation via topic modeling: an analytical study

Proceedings of the 18th ACM conference on Information and knowledge management
An analysis of quantitative aspects in the evaluation of thematic segmentation algorithms

SigDIAL '06 Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue
Participant subjectivity and involvement as a basis for discourse segmentation

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Story segmentation and topic classification of broadcast news via a topic-based segmental model and a genetic algorithm

IEEE Transactions on Audio, Speech, and Language Processing
Content modeling using latent permutations

Journal of Artificial Intelligence Research
Word distribution based methods for minimizing segment overlaps

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Segmentation and annotation of audiovisual recordings based on automated speech recognition

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
A statistical model for topic segmentation and clustering

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Evaluating hierarchical discourse segmentation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Linear text segmentation using classification techniques

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
Unsupervised discourse segmentation of documents with inherently parallel structure

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
NewsViz: emotional visualization of news stories

CAAGET '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text
The CALO meeting assistant system

IEEE Transactions on Audio, Speech, and Language Processing
Contextually-mediated semantic similarity graphs for topic segmentation

TextGraphs-5 Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing
Multi-document topic segmentation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Story segmentation for speech transcripts in sparse data conditions

Proceedings of the 2010 international workshop on Searching spontaneous conversational speech
Coverage-based methods for distributional stopword selection in text segmentation

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Identification of rhetorical roles for segmentation and summarization of a legal judgment

Artificial Intelligence and Law
Deliberate word access: an intuition, a roadmap and some preliminary empirical results

International Journal of Speech Technology
Unsupervised extraction of text segments from heterogeneous document collections

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
Assessing the effectiveness of conversational features for dialogue segmentation in medical team meetings and in the AMI corpus

SIGDIAL '10 Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Local space-time smoothing for version controlled documents

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Imposing hierarchical browsing structures onto spoken documents

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Identification of conjunct verbs in hindi and its effect on parsing accuracy

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Improving text segmentation with non-systematic semantic relation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

Information Sciences: an International Journal
Text segmentation: A topic modeling perspective

Information Processing and Management: an International Journal
An iterative approach to text segmentation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A normalized-cut alignment model for mapping hierarchical semantic structures onto spoken documents

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Segmenting eBay item descriptions into coherent sections

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Speech retrieval from unsegmented finnish audio using statistical morpheme-like units for segmentation, recognition, and retrieval

ACM Transactions on Speech and Language Processing (TSLP)
A statistical model for topically segmented documents

DS'11 Proceedings of the 14th international conference on Discovery science
Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation

Computer Speech and Language
Statistical source expansion for question answering

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
ACME: an associative classifier based on maximum entropy principle

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Shallow dialogue processing using machine learning algorithms (or not)

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
A machine text-inspired machine learning approach for identification of transmembrane helix boundaries

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Contextual correlation based thread detection in short text message streams

Journal of Intelligent Information Systems
Text segmentation by product partition models and dynamic programming

Mathematical and Computer Modelling: An International Journal
The nonverbal structure of patient case discussions in multidisciplinary medical team meetings

ACM Transactions on Information Systems (TOIS)
Segmentation similarity and agreement

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Topical segmentation: a study of human performance and a new measure of quality

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
How text segmentation algorithms gain from topic models

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Sweeping through the topic space: bad luck? Roll again!

ROBUS-UNSUP '12 Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP
TopicTiling: a text segmentation algorithm based on LDA

ACL '12 Proceedings of ACL 2012 Student Research Workshop
SITS: a hierarchical nonparametric model using speaker identity for topic segmentation in multiparty conversations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Two-part segmentation of text documents

Proceedings of the 21st ACM international conference on Information and knowledge management
Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Exploiting hybrid contexts for Tweet segmentation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
An unsupervised topic segmentation model incorporating word order

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Triggering effective social support for online groups

ACM Transactions on Interactive Intelligent Systems (TiiS)
Unsupervised text segmentation using LDA and MCMC

AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134
Topic segmentation and labeling in asynchronous conversations

Journal of Artificial Intelligence Research
A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimization

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new statistical approach toautomatically partitioning text into coherent segments. The approach isbased on a technique that incrementally builds an exponential model toextract features that are correlated with the presence of boundaries inlabeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel wayto detect broad changes of topic, and cue-word features thatdetect occurrences of specific words, which may be domain-specific, thattend to be used near segment boundaries. Assessment of our approach onquantitative and qualitative grounds demonstrates its effectiveness intwo very different domains, Wall Street Journal news articlesand television broadcast news story transcripts. Quantitative resultson these domains are presented using a new probabilistically motivatederror metric, which combines precision and recall in a natural andflexible way. This metric is used to make a quantitative assessment ofthe relative contributions of the different feature types, as well as acomparison with decision trees and previously proposed text segmentationalgorithms.