Universal Segmentation of Text with the Sumo Formalism

Authors:
Julien Quint
Affiliations:
-
Venue:
NLP '00 Proceedings of the Second International Conference on Natural Language Processing
Year:
2000

Citing 4
Cited 0

A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Critical tokenization and its properties

Computational Linguistics
Two parsing algorithms by means of finite state transducers

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a universal formalism for the segmentation of text documents called Sumo. Its main purpose is to help creating segmentation systems for documents in any language. Because the processing is independent of the language, any level of segmentation (be it character, word, sentence, paragraph, etc.) can be considered. We will argue about the usefulness of such a formalism, describe the framework for segmentation on which Sumo relies, and give detailed examples to demonstrate some of its features.