Acquisition of Morphology of an Indic Language from Text Corpus

  • Authors:
  • Utpal Sharma;Jugal K. Kalita;Rajib K. Das

  • Affiliations:
  • Tezpur University;University of Colorado;Calcutta University

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article describes an approach to unsupervised learning ofmorphology from an unannotated corpus for a highly inflectionalIndo-European language called Assamese spoken by about 30 millionpeople. Although Assamese is one of Indias national languages, itutterly lacks computational linguistic resources. There exists noprior computational work on this language spoken widely innortheast India. The work presented is pioneering in this respect.In this article, we discuss salient issues in Assamese morphologywhere the presence of a large number of suffixal determiners,sandhi, samas, and the propensity to use suffix sequences makeapproximately 50% of the words used in written and spoken textinflected. We implement methods proposed by Gaussier and Goldsmithon acquisition of morphological knowledge, and obtain F-measureperformance below 60%. This motivates us to present a method moresuitable for handling suffix sequences, enabling us to increase theF-measure performance of morphology acquisition to almost 70%. Wedescribe how we build a morphological dictionary for Assamese fromthe text corpus. Using the morphological knowledge acquired and themorphological dictionary, we are able to process small chunks ofdata at a time as well as a large corpus. We achieve approximately85% precision and recall during the analysis of small chunks ofcoherent text.