An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

  • Authors:
  • Michael R. Brent

  • Affiliations:
  • Department of Cognitive Science, Johns Hopkins University, Baltimore, MD 21218. brent@jhu.edu

  • Venue:
  • Machine Learning - Special issue on natural language learning
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a model-based, unsupervised algorithmfor recovering word boundaries in a natural-language text from whichthey have been deleted. The algorithm is derived from a probabilitymodel of the source that generated the text. The fundamentalstructure of the model is specified abstractly so that the detailedcomponent models of phonology, word-order, and word frequency can bereplaced in a modular fashion. The model yields alanguage-independent, prior probability distribution on all possiblesequences of all possible words over a given alphabet, based on theassumption that the input was generated by concatenating words from afixed but unknown lexicon. The model is unusual in that it treatsthe generation of a complete corpus, regardless of length, as asingle event in the probability space. Accordingly, the algorithmdoes not estimate a probability distribution on words; instead, itattempts to calculate the prior probabilities of various wordsequences that could underlie the observed text. Experiments onphonemic transcripts of spontaneous speech by parents to youngchildren suggest that our algorithm is more effective than otherproposed algorithms, at least when utterance boundaries are given andthe text includes a substantial number of short utterances.