An unsupervised Hindi stemmer with heuristic improvements

  • Authors:
  • Amaresh Kumar Pandey;Tanveer J Siddiqui

  • Affiliations:
  • Indian Institute of Information Technology - Allahabad, Allahabad, India;Indian Institute of Information Technology - Allahabad, Allahabad, India

  • Venue:
  • Proceedings of the second workshop on Analytics for noisy unstructured text data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stemmers are used to convert inflected words into their root or stem. Stem does not necessarily correspond to linguistic root of a word. Stemming improve performance by reducing morphologically variants into same words. This paper presents an approach is to develop unsupervised Hindi stemmer. This paper focus on the development of an unsupervised stemmer for Hindi and evaluation of approach using manually segmented words. We evaluate our approach on 1000-1000 words randomly extracted words (only) from Hindi WordNet1 data base. The training data has been constructed by extracting 106403 words extracted from EMILLE2 corpus. The observed accuracy was found to be 89.9% after applying some heuristic measures. The F-score was 94.96%. As the algorithm does not require any language specific information, it can be applied to other Indian languages as well. We also evaluate the effect of stemmer in terms of reducing size of index for Hindi information retrieval task. The results have been compared with light weight stemmer [10] and UMass stemmer [17]. Test run shows that our stemmer outperforms both the stemmer.