Effective and Robust Query-Based Stemming

  • Authors:
  • Jiaul H. Paik;Swapan K. Parui;Dipasree Pal;Stephen E. Robertson

  • Affiliations:
  • Indian Statistical Institute, Kolkata;Indian Statistical Institute, Kolkata;Indian Statistical Institute, Kolkata;Microsoft Research, Cambridge, UK

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads to poor performance. Moreover, for many queries, such approaches also produce retrieval performance that is poorer than no stemming, thereby degrading the robustness. The main goal of this article is to present corpus-based fully automatic stemming algorithms which address these issues. A set of experiments on six TREC collections and three other non-English collections containing news and web documents shows that the proposed query-based stemming algorithms consistently and significantly outperform four state of the art strong stemmers of completely varying principles. Our experiments also confirm that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.