Distribution based stemmer refinement

  • Authors:
  • B. L. Narayan;Sankar K. Pal

  • Affiliations:
  • Machine Intelligence Unit, Indian Statistical Institute, Calcutta, India;Machine Intelligence Unit, Indian Statistical Institute, Calcutta, India

  • Venue:
  • PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stemming is a common preprocessing task applied to text corpora. Errors in this process may be refined either manually or based on a corpus. We describe a novel corpus-based stemming technique which models the given words as being generated from a multinomial distribution over the topics available in the corpus. A sequential hypothesis testing like procedure helps us group together distributionally similar words. This stemmer refines any given stemmer and its strength can be controlled with the help of two thresholds. A refinement based on the 20 Newsgroups data set shows that the proposed method splits equivalence classes appropriately.