Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Authors:
Ljiljana Dolamic;Jacques Savoy
Affiliations:
University of Neuchatel;University of Neuchatel
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2010

Citing 24
Cited 4

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Experimentation as a way of life: Okapi at TREC

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
A stop list for general text

ACM SIGIR Forum
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software

Information Retrieval
Cross-Language Evaluation Forum: Objectives, Results, Achievements

Information Retrieval
Monolingual Document Retrieval for European Languages

Information Retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Introduction to Information Retrieval

Introduction to Information Retrieval
Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers

Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Algorithmic stemmers or morphological analysis? An evaluation

Journal of the American Society for Information Science and Technology
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Statistical and comparative evaluation of various indexing and search models

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Comparative information retrieval evaluation for scanned documents

Proceedings of the 15th WSEAS international conference on Computers
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)
Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them. In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc. Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.