Morphology induction from limited noisy data using approximate string matching

  • Authors:
  • Burcu Karagol-Ayan;David Doermann;Amy Weinberg

  • Affiliations:
  • University of Maryland, College Park, MD;University of Maryland, College Park, MD;University of Maryland, College Park, MD

  • Venue:
  • SIGPHON '06 Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

For a language with limited resources, a dictionary may be one of the few available electronic resources. To make effective use of the dictionary for translation, however, users must be able to access it using the root form of morphologically deformed variant found in the text. Stemming and data driven methods, however, are not suitable when data is sparse. We present algorithms for discovering morphemes from limited, noisy data obtained by scanning a hard copy dictionary. Our approach is based on the novel application of the longest common substring and string edit distance metrics. Results show that these algorithms can in fact segment words into roots and affixes from the limited data contained in a dictionary, and extract affixes. This in turn allows non native speakers to perform multilingual tasks for applications where response must be rapid, and their knowledge is limited. In addition, this analysis can feed other NLP tools requiring lexicons.