Unsupervised segmentation of words using prior distributions of morph length and frequency

  • Authors:
  • Mathias Creutz

  • Affiliations:
  • Helsinki University of Technology, HUT, Finland

  • Venue:
  • ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data.