Viewing morphology as an inference process
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
Corpus-based stemming using cooccurrence of word variants
ACM Transactions on Information Systems (TOIS)
Experimentation as a way of life: Okapi at TREC
Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
ACM SIGIR Forum
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
Cross-Language Evaluation Forum: Objectives, Results, Achievements
Information Retrieval
Monolingual Document Retrieval for European Languages
Information Retrieval
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?
Information Retrieval
A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
Stemming and lemmatization in the clustering of finnish text documents
Proceedings of the thirteenth ACM international conference on Information and knowledge management
The SMART Retrieval System—Experiments in Automatic Document Processing
The SMART Retrieval System—Experiments in Automatic Document Processing
Light stemming approaches for the French, Portuguese, German and Hungarian languages
Proceedings of the 2006 ACM symposium on Applied computing
YASS: Yet another suffix stripper
ACM Transactions on Information Systems (TOIS)
Introduction to Information Retrieval
Introduction to Information Retrieval
Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers
Addressing morphological variation in alphabetic languages
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Algorithmic stemmers or morphological analysis? An evaluation
Journal of the American Society for Information Science and Technology
Indexing and stemming approaches for the Czech language
Information Processing and Management: an International Journal
Is a morphologically complex language really that complex in full-text retrieval?
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Statistical and comparative evaluation of various indexing and search models
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
A novel corpus-based stemming algorithm using co-occurrence statistics
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Comparative information retrieval evaluation for scanned documents
Proceedings of the 15th WSEAS international conference on Computers
GRAS: An effective and efficient stemming algorithm for information retrieval
ACM Transactions on Information Systems (TOIS)
Effective and Robust Query-Based Stemming
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them. In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc. Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.