Morphology induction from limited noisy data using approximate string matching

Authors:
Burcu Karagol-Ayan;David Doermann;Amy Weinberg
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
SIGPHON '06 Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology
Year:
2006

Citing 10
Cited 1

The String-to-String Correction Problem

Journal of the ACM (JACM)
Unsupervised learning of the morphology of a natural language

Computational Linguistics
A Bayesian model for morpheme and paradigm identification

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised learning of morphology without morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised discovery of morphologically related words based on orthographic and semantic similarity

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
A framework for unsupervised natural language morphology induction

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Morphemes as necessary concept for structures discovery from untagged corpora

NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Multilingual noise-robust supervised morphological analysis using the WordFrame model

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

Arabic retrieval revisited: morphological hole filling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a language with limited resources, a dictionary may be one of the few available electronic resources. To make effective use of the dictionary for translation, however, users must be able to access it using the root form of morphologically deformed variant found in the text. Stemming and data driven methods, however, are not suitable when data is sparse. We present algorithms for discovering morphemes from limited, noisy data obtained by scanning a hard copy dictionary. Our approach is based on the novel application of the longest common substring and string edit distance metrics. Results show that these algorithms can in fact segment words into roots and affixes from the limited data contained in a dictionary, and extract affixes. This in turn allows non native speakers to perform multilingual tasks for applications where response must be rapid, and their knowledge is limited. In addition, this analysis can feed other NLP tools requiring lexicons.