Acquisition of Morphology of an Indic Language from Text Corpus

Authors:
Utpal Sharma;Jugal K. Kalita;Rajib K. Das
Affiliations:
Tezpur University;University of Colorado;Calcutta University
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2008

Citing 8
Cited 3

Memory-Based Lexical Acquisition and Processing

Proceedings of the Third International EAMT Workshop on Machine Translation and the Lexicon
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Acquiring receptive morphology: a connectionist model

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
On the Statistical Properties of the F-measure

QSIC '04 Proceedings of the Quality Software, Fourth International Conference
Unsupervised segmentation of words using prior distributions of morph length and frequency

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Unsupervised learning of morphology for building lexicon for a highly inflectional language

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised learning of morphology using a novel directed search algorithm: taking the first step

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Induction of a simple morphology for highly-inflecting languages

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

Part of speech tagger for Assamese text

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Analysis and evaluation of stemming algorithms: a case study with Assamese

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
An improved stemming approach using HMM for a highly inflectional language

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article describes an approach to unsupervised learning ofmorphology from an unannotated corpus for a highly inflectionalIndo-European language called Assamese spoken by about 30 millionpeople. Although Assamese is one of Indias national languages, itutterly lacks computational linguistic resources. There exists noprior computational work on this language spoken widely innortheast India. The work presented is pioneering in this respect.In this article, we discuss salient issues in Assamese morphologywhere the presence of a large number of suffixal determiners,sandhi, samas, and the propensity to use suffix sequences makeapproximately 50% of the words used in written and spoken textinflected. We implement methods proposed by Gaussier and Goldsmithon acquisition of morphological knowledge, and obtain F-measureperformance below 60%. This motivates us to present a method moresuitable for handling suffix sequences, enabling us to increase theF-measure performance of morphology acquisition to almost 70%. Wedescribe how we build a morphological dictionary for Assamese fromthe text corpus. Using the morphological knowledge acquired and themorphological dictionary, we are able to process small chunks ofdata at a time as well as a large corpus. We achieve approximately85% precision and recall during the analysis of small chunks ofcoherent text.