Scalability of Source Identification in Data Integration Systems

  • Authors:
  • François Boisson;Michel Scholl;Imen Sebei;Dan Vodislav

  • Affiliations:
  • CNAM/CEDRIC, Paris, France;CNAM/CEDRIC, Paris, France;CNAM/CEDRIC, Paris, France;CNAM/CEDRIC, Paris, France

  • Venue:
  • Advanced Internet Based Systems and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a large number of data sources, each of them being indexed by attributes from a predefined set $\cal{A}$ and given a query q over a subset Q of $\cal{A}$ with size k attributes, we are interested in identifying the set of all possible combinations of sources such that the union of their attributes covers Q . Each combination c may lead to a rewriting of q as a join over the sources in c . Furthermore, to limit redundancy and combinatorial explosion, we want the combination of sources to produce a minimal cover of Q . Although motivated by query rewriting in OpenXView [3], an XML data integration system with a large number of XML sources, we believe that the solutions provided in this paper apply to other scalable data integration schemes. In this paper we focus on the cases where the number of sources is very large, while the size of queries is small. We propose a novel algorithm for the computation of the set of minimal covers of a query and experimentally evaluate its performance.