An improved stemming approach using HMM for a highly inflectional language

Authors:
Navanath Saharia;Kishori M. Konwar;Utpal Sharma;Jugal K. Kalita
Affiliations:
Department of CSE, Tezpur University, India;Department of MI, University of British Columbia, Canada;Department of CSE, Tezpur University, India;Department of CS, University of Colorado at Colorado Springs
Venue:
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Year:
2013

Citing 9
Cited 0

Strength and similarity of affix removal stemming algorithms

ACM SIGIR Forum
Unsupervised learning of morphology for building lexicon for a highly inflectional language

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Acquisition of Morphology of an Indic Language from Text Corpus

ACM Transactions on Asian Language Information Processing (TALIP)
An unsupervised Hindi stemmer with heuristic improvements

Proceedings of the second workshop on Analytics for noisy unstructured text data
Towards an error-free Arabic stemming

Proceedings of the 2nd ACM workshop on Improving non english web searching
Induction of a simple morphology for highly-inflecting languages

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
A Suffix-Based Noun and Verb Classifier for an Inflectional Language

IALP '10 Proceedings of the 2010 International Conference on Asian Language Processing
Analysis and evaluation of stemming algorithms: a case study with Assamese

Proceedings of the International Conference on Advances in Computing, Communications and Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.