A Fast Corpus-Based Stemmer

Authors:
Jiaul H. Paik;Swapan K. Parui
Affiliations:
Indian Statistical Institute;Indian Statistical Institute
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2011

Citing 13
Cited 1

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Viewing stemming as recall enhancement

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
A morphologically sensitive clustering algorithm for identifying Arabic roots

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Introduction to Information Retrieval

Introduction to Information Retrieval

Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. Although its usefulness is shown to be mixed in languages such as English, because morphologically complex languages stemming produces a significant performance improvement. A number of linguistic rule-based stemmers are available for most European languages which employ a set of rules to get back the root word from its variants. But for Indian languages which are highly inflectional in nature, devising a linguistic rule-based stemmer needs some additional resources which are not available. We present an approach which is purely corpus based and finds the equivalence classes of variant words in an unsupervised manner. A set of experiments on four languages using FIRE, CLEF, and TREC test collections shows that our approach provides comparable results with linguistic rule-based stemmers for some languages and gives significant performance improvement for resource constrained languages such as Bengali and Marathi.