Statistical Models for Text Segmentation

  • Authors:
  • Doug Beeferman;Adam Berger;John Lafferty

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

  • Venue:
  • Machine Learning - Special issue on natural language learning
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a new statistical approach toautomatically partitioning text into coherent segments. The approach isbased on a technique that incrementally builds an exponential model toextract features that are correlated with the presence of boundaries inlabeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel wayto detect broad changes of topic, and cue-word features thatdetect occurrences of specific words, which may be domain-specific, thattend to be used near segment boundaries. Assessment of our approach onquantitative and qualitative grounds demonstrates its effectiveness intwo very different domains, Wall Street Journal news articlesand television broadcast news story transcripts. Quantitative resultson these domains are presented using a new probabilistically motivatederror metric, which combines precision and recall in a natural andflexible way. This metric is used to make a quantitative assessment ofthe relative contributions of the different feature types, as well as acomparison with decision trees and previously proposed text segmentationalgorithms.