Efficient unsupervised recursive word segmentation using minimum description length

  • Authors:
  • Shlomo Argamon;Navot Akiva;Amihood Amir;Oren Kapah

  • Affiliations:
  • Illinois Institute of Technology, Chicago, IL;Bar-Ilan University, Israel;Bar-Ilan University, Israel;Bar-Ilan University, Israel

  • Venue:
  • COLING '04 Proceedings of the 20th international conference on Computational Linguistics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic word segmentation is a basic requirement for unsupervised learning in morphological analysis. In this paper, we formulate a novel recursive method for minimum description length (MDL) word segmentation, whose basic operation is resegmenting the corpus on a prefix (equivalently, a suffix). We derive a local expression for the change in description length under resegmentation, i.e., one which depends only on properties of the specific prefix (not on the rest of the corpus). Such a formulation permits use of a new and efficient algorithm for greedy morphological segmentation of the corpus in a recursive manner. In particular, our method does not restrict words to be segmented only once, into a stem+affix form, as do many extant techniques. Early results for English and Turkish corpora are promising.