Exploring criteria for successful query expansion in the genomic domain

  • Authors:
  • Nicola Stokes;Yi Li;Lawrence Cavedon;Justin Zobel

  • Affiliations:
  • NICTA Victoria Research Lab, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia;NICTA Victoria Research Lab, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia;NICTA Victoria Research Lab, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia;NICTA Victoria Research Lab, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia

  • Venue:
  • Information Retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome of these experiments, we discuss the success factors required for effective query expansion with respect to various sources of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline system.