Effective and Robust Query-Based Stemming

Authors:
Jiaul H. Paik;Swapan K. Parui;Dipasree Pal;Stephen E. Robertson
Affiliations:
Indian Statistical Institute, Kolkata;Indian Statistical Institute, Kolkata;Indian Statistical Institute, Kolkata;Microsoft Research, Cambridge, UK
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2013

Citing 25
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
An algorithm for suffix stripping

Readings in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Improving the effectiveness of information retrieval with local context analysis

ACM Transactions on Information Systems (TOIS)
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Unsupervised learning of the morphology of a natural language

Computational Linguistics
A probabilistic model for stemmer generation

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
The TREC robust retrieval track

ACM SIGIR Forum
Flexible pseudo-relevance feedback via selective sampling

ACM Transactions on Asian Language Information Processing (TALIP)
Unsupervised discovery of morphologically related words based on orthographic and semantic similarity

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Estimation and use of uncertainty in pseudo-relevance feedback

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Context sensitive stemming for web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Term necessity prediction

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A Fast Corpus-Based Stemmer

ACM Transactions on Asian Language Information Processing (TALIP)
A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads to poor performance. Moreover, for many queries, such approaches also produce retrieval performance that is poorer than no stemming, thereby degrading the robustness. The main goal of this article is to present corpus-based fully automatic stemming algorithms which address these issues. A set of experiments on six TREC collections and three other non-English collections containing news and web documents shows that the proposed query-based stemming algorithms consistently and significantly outperform four state of the art strong stemmers of completely varying principles. Our experiments also confirm that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.