A novel corpus-based stemming algorithm using co-occurrence statistics

Authors:
Jiaul H. Paik;Dipasree Pal;Swapan K. Parui
Affiliations:
Indian Statistical Institute, Kolkata, India;Indian Statistical Institute, Kolkata, India;Indian Statistical Institute, Kolkata, India
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 21
Cited 1

An evaluation of some conflation algorithms for information retrieval

Document retrieval systems
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
An algorithm for suffix stripping

Readings in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
Improving the effectiveness of information retrieval with local context analysis

ACM Transactions on Information Systems (TOIS)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Unsupervised learning of the morphology of a natural language

Computational Linguistics
A probabilistic model for stemmer generation

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Context sensitive stemming for web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Introduction to Information Retrieval

Introduction to Information Retrieval
Bulgarian, Hungarian and Czech Stemming Using YASS

Advances in Multilingual and Multimodal Information Retrieval
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)

Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how often a pair of word variants occurs in a document as well as in the whole corpus. A graph is formed where the word variants are the nodes and two word variants form an edge if they co-occur. On the basis of the co-occurrence measure, a certain edge strength is defined for each of the edges. Finally, on the basis of the edge strengths, we propose a partition algorithm that groups the word variants based on their strongest neighbors, that is, the neighbors with largest strengths. Our stemming algorithm has two static parameters and does not use any other information except the co-occurrence statistics from the corpus. The experiments on TREC, CLEF and FIRE data consisting of four European and two Asian languages show a significant improvement over no-stem strategy on all the languages. Also, the proposed algorithm significantly outperforms a number of strong stemmers including the rule-based ones on a number of languages. For highly inflectional languages, a relative improvement of about 50% is obtained compared to un-normalized words and a relative improvement ranging from 5% to 16% is obtained compared to the rule based stemmer for the concerned language.