Distribution based stemmer refinement

Authors:
B. L. Narayan;Sankar K. Pal
Affiliations:
Machine Intelligence Unit, Indian Statistical Institute, Calcutta, India;Machine Intelligence Unit, Indian Statistical Institute, Calcutta, India
Venue:
PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence
Year:
2005

Citing 8
Cited 1

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Method for evaluation of stemming algorithms based on error counting

Journal of the American Society for Information Science
Viewing stemming as recall enhancement

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
A vector space model for automatic indexing

Communications of the ACM
Strength and similarity of affix removal stemming algorithms

ACM SIGIR Forum
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics

Google stemming mechanisms

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is a common preprocessing task applied to text corpora. Errors in this process may be refined either manually or based on a corpus. We describe a novel corpus-based stemming technique which models the given words as being generated from a multinomial distribution over the topics available in the corpus. A sequential hypothesis testing like procedure helps us group together distributionally similar words. This stemmer refines any given stemmer and its strength can be controlled with the help of two thresholds. A refinement based on the 20 Newsgroups data set shows that the proposed method splits equivalence classes appropriately.