A grammatico-statistical approach to discourse partitioning

  • Authors:
  • Tadashi Nomoto;Yoshihiko Nitta

  • Affiliations:
  • Advanced Research Laboratory, Hitachi Ltd.;Advanced Research Laboratory, Hitachi Ltd.

  • Venue:
  • COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper presents a new approach to text segmentation - which concerns dividing a text into coherent discourse units. The approach builds on the theory of discourse segment (Nomoto and Nitta, 1993), incorporating ideas from the research on information retrieval (Salton, 1988). A discourse segment has to do with a structure of Japanese discourse; it could be thought of as a linguistic unit demarcated by wa, a Japanese topic particle, which may extend over several sentences. The segmentation works with discourse segments and makes use of coherence measure based on tf-idf, a standard information retrieval measurement (Salton, 1988; Hearst, 1993). Experiments have been done with a Japanese newspaper corpus. It has been found that the present approach is quite successful in recovering articles from the unstructured corpus.