An unsupervised Hindi stemmer with heuristic improvements

Authors:
Amaresh Kumar Pandey;Tanveer J Siddiqui
Affiliations:
Indian Institute of Information Technology - Allahabad, Allahabad, India;Indian Institute of Information Technology - Allahabad, Allahabad, India
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 9
Cited 4

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
A Bayesian model for morpheme and paradigm identification

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Unsupervised segmentation of words using prior distributions of morph length and frequency

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Multilingual noise-robust supervised morphological analysis using the WordFrame model

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Morphology induction from term clusters

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning

Large-coverage root lexicon extraction for Hindi

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Assas-Band, an affix-exception-list based Urdu stemmer

ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Analysis and evaluation of stemming algorithms: a case study with Assamese

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
An improved stemming approach using HMM for a highly inflectional language

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemmers are used to convert inflected words into their root or stem. Stem does not necessarily correspond to linguistic root of a word. Stemming improve performance by reducing morphologically variants into same words. This paper presents an approach is to develop unsupervised Hindi stemmer. This paper focus on the development of an unsupervised stemmer for Hindi and evaluation of approach using manually segmented words. We evaluate our approach on 1000-1000 words randomly extracted words (only) from Hindi WordNet1 data base. The training data has been constructed by extracting 106403 words extracted from EMILLE2 corpus. The observed accuracy was found to be 89.9% after applying some heuristic measures. The F-score was 94.96%. As the algorithm does not require any language specific information, it can be applied to other Indian languages as well. We also evaluate the effect of stemmer in terms of reducing size of index for Hindi information retrieval task. The results have been compared with light weight stemmer [10] and UMass stemmer [17]. Test run shows that our stemmer outperforms both the stemmer.