A stochastic finite-state word-segmentation algorithm for Chinese

  • Authors:
  • Richard Sproat;William Gale;Chilin Shih;Nancy Chang

  • Affiliations:
  • Bell Laboratories;Bell Laboratories;Bell Laboratories;University of Cambridge

  • Venue:
  • Computational Linguistics
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and---since the primary intended application of this model is to text-to-speech synthesis---provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgements of a pool of human segmenters, and the system is shown to perform quite well.