Top-K data source selection for keyword queries over multiple XML data sources

  • Authors:
  • Khanh Nguyen;Jinli Cao

  • Affiliations:
  • La Trobe University, Australia;La Trobe University, Australia

  • Venue:
  • Journal of Information Science
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the proliferation of XML data, searching XML data using keyword queries has attracted much attention. However, most of the current approaches focus on keyword-based searches over a single XML document. Searching over a system integrating hundreds or even thousands of data sources by sequentially querying every single source is extremely costly, and thus may be impractical. In this article we propose a novel approach for selecting the top-K data sources by relying on their relevance to a given query, to avoid the high cost of searching in numerous, potentially irrelevant data sources. Our approach summarizes the data sources as succinct synopses for the rapid filtering of non-promising sources. We maintain both structural and value distribution information of each data source, and propose a novel ranking function to measure effectively the relevance of the data source to the given query. We conducted experiments with real datasets, and results show that our approach achieves high performances in all evaluation metrics: recall, precision and Spearman's rank correlation coefficient with different experimental parameters.