YASS: Yet another suffix stripper

Authors:
Prasenjit Majumder;Mandar Mitra;Swapan K. Parui;Gobinda Kole;Pabitra Mitra;Kalyankumar Datta
Affiliations:
Indian Statistical Institute, Kolkata, India;Indian Statistical Institute, Kolkata, India;Indian Statistical Institute, Kolkata, India;Indian Statistical Institute, Kolkata, India;Indian Institute of Technology, Kharagpur, India;Jadavpur University, Calcutta, India
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2007

Citing 12
Cited 17

Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Data clustering: a review

ACM Computing Surveys (CSUR)
Viewing morphology as an inference process

Artificial Intelligence - Special issue on Intelligent internet systems
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Automatic Language-Specific Stemming in Information Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Cross language information retrieval: a research roadmap

ACM SIGIR Forum
Unsupervised learning of the morphology of a natural language

Computational Linguistics
A probabilistic model for stemmer generation

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
Unsupervised learning of Arabic stemming using a parallel corpus

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A morphologically sensitive clustering algorithm for identifying Arabic roots

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Summarization of compressed text images: an experience on Indic script documents

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
An unsupervised Hindi stemmer with heuristic improvements

Proceedings of the second workshop on Analytics for noisy unstructured text data
Automatic acquisition of inflectional lexica for morphological normalisation

Information Processing and Management: an International Journal
Issues in searching for Indian language web content

Proceedings of the 2nd ACM workshop on Improving non english web searching
Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
The FIRE 2008 Evaluation Exercise

ACM Transactions on Asian Language Information Processing (TALIP)
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

ACM Transactions on Asian Language Information Processing (TALIP)
Ontology emergence from folksonomies

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
A Fast Corpus-Based Stemmer

ACM Transactions on Asian Language Information Processing (TALIP)
A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)
Analysis and evaluation of stemming algorithms: a case study with Assamese

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
An improved stemming approach using HMM for a highly inflectional language

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.