Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

Authors:
Johannes Leveling;Gareth J. F. Jones
Affiliations:
Dublin City University;Dublin City University
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2010

Citing 31
Cited 0

On term selection for query expansion

Journal of Documentation
Lexical analysis and stoplists

Information retrieval
A system for retrieving speech documents

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Overlapping statistical word indexing: a new indexing method for Japanese text

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Improving two-stage ad-hoc retrieval for short queries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Subword-based approaches for spoken document retrieval

Subword-based approaches for spoken document retrieval
Multilingual Information Retrieval Using Machine Translation, Relevance Feedback and Decompounding

Information Retrieval
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
Questioning query expansion: an examination of behaviour and parameters

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
A multi-system analysis of document and term selection for blind feedback

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Entry vocabulary: a technology to enhance digital search

HLT '01 Proceedings of the first international conference on Human language technology research
Assessing the retrieval effectiveness of a speech retrieval system by simulating recognition errors

HLT '94 Proceedings of the workshop on Human Language Technology
Context-sensitive information retrieval using implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
WebKhoj: Indian language IR from multiple character encodings

Proceedings of the 15th international conference on World Wide Web
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Issues in searching for Indian language web content

Proceedings of the 2nd ACM workshop on Improving non english web searching
Textual representations for corpus-based bilingual retrieval

Textual representations for corpus-based bilingual retrieval
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
Stemming and decompounding for German text retrieval

ECIR'03 Proceedings of the 25th European conference on IR research
University of hagen at CLEF 2004: indexing and translating concepts for the GIRT task

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this article: 1) How to create create a simple, language-independent corpus-based stemmer, 2) How to identify sub-words and which types of sub-words are suitable as indexing units, and 3) How to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conflation step and useful in the case of few language-specific resources. For English, the corpus-based stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages.