A Fast Corpus-Based Stemmer

  • Authors:
  • Jiaul H. Paik;Swapan K. Parui

  • Affiliations:
  • Indian Statistical Institute;Indian Statistical Institute

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. Although its usefulness is shown to be mixed in languages such as English, because morphologically complex languages stemming produces a significant performance improvement. A number of linguistic rule-based stemmers are available for most European languages which employ a set of rules to get back the root word from its variants. But for Indian languages which are highly inflectional in nature, devising a linguistic rule-based stemmer needs some additional resources which are not available. We present an approach which is purely corpus based and finds the equivalence classes of variant words in an unsupervised manner. A set of experiments on four languages using FIRE, CLEF, and TREC test collections shows that our approach provides comparable results with linguistic rule-based stemmers for some languages and gives significant performance improvement for resource constrained languages such as Bengali and Marathi.