Parameter tuning in pivoted normalization for XML retrieval: ISI@INEX09 aadhoc focused task

Authors:
Sukomal Pal;Mandar Mitra;Debasis Ganguly
Affiliations:
Information Retrieval Lab, CVPR Unit, Indian Statistical Institute, Kolkata, India;Information Retrieval Lab, CVPR Unit, Indian Statistical Institute, Kolkata, India;Synopsys, Bangalore, India
Venue:
INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Year:
2009

Citing 6
Cited 1

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Term weighting revisited

Term weighting revisited
Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A blueprint for automatic indexing

ACM SIGIR Forum
The Wikipedia XML Corpus

Comparative Evaluation of XML Information Retrieval Systems
Indian Statistical Institute at INEX 2007 Adhoc Track: VSM Approach

Focused Access to XML Documents

DCU and ISI@INEX 2010: adhoc and data-centric tracks

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the work that we did at Indian Statistical Institute towards XML retrieval for INEX 2009. Since there has been an abrupt quantum jump in the INEX corpus size (from 4.6 GB with 659,388 articles to 50.7 GB with 2,666,190 articles), retrieval algorithms and systems were put to a 'stress test' in the INEX 2009 campaign. We tuned our text retrieval system (SMART) based on the Vector Space Model (VSM) that we have been using since INEX 2006. We submitted two runs for the adhoc focused task. Both the runs used VSM-based document-level retrieval with blind feedback: an initial run (indsta_VSMpart) used only a small fraction of INEX 2009 corpus; the other used the full corpus (indsta_VSMfb). We considered Content-Only (CO) retrieval, using the Title and Description fields of the INEX 2009 adhoc queries (2009001-2009115). Our official runs, however, used incorrect topic numbers. This led to very dismal performance. Post-submission, the corrected version of both baseline and with-feedback document-level runs achieved competitive scores. We performed a set of experiments to tune our pivoted normalization-based term-weighting scheme for XML retrieval. The scores of our best document-level runs, both with and without blind feedback, seemed to substantially improve after tuning of normalization parameters. We also ran element-level retrieval on a subset of the document-level runs; the new parameter settings seemed to yield competitive results in this case as well. On the evaluation front, we observed an anomaly in the implementation of the evaluation-scripts while interpolated precision is being calculated. We raise the issue since a XML retrievable unit (passage/element) can be partially relevant containing a portion of non-relevant text, unlike document retrieval paradigm where a document is considered either completely relevant or completely nonrelevant.