A formalism for universal segmentation of text

  • Authors:
  • Julien Quint

  • Affiliations:
  • Xerox Research Centre Europe, Meylan, France

  • Venue:
  • COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sumo is a formalism for universal segmentation of text. Its purpose is to provide a framework for the creation of segmentation applications. It is called "universal" as the formalism itself is independent of the language of the documents to process and independent of the levels of segmentation (e.g. words, sentences, paragraphs, morphemes...) considered by the target application. This framework relies on a layered structure representing the possible segmentations of the document. This structure and the tools to manipulate it are described, followed by detailed examples highlighting some features of Sumo.