Data integration with dependent sources

  • Authors:
  • Anish Das Sarma;Xin Luna Dong;Alon Halevy

  • Affiliations:
  • Yahoo Research;AT&T Labs--Research;Google Inc.

  • Venue:
  • Proceedings of the 14th International Conference on Extending Database Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.02

Visualization

Abstract

Data integration systems offer users a uniform interface to a set of data sources. Previous work has typically assumed that the data sources are independent of each other; however, in scenarios involving large numbers of sources, such as the Web or large enterprises, there is an eco-system of dependent sources, where some sources copy parts of their data from others. This paper considers the new optimization problems that arise while answering queries over large number of dependent sources. These are the (1) cost-minimization problem: what is the minimum cost we must incur to get all answer tuples, (2) maximum-coverage problem: given a bound on the cost, how can we get the maximum possible coverage, and (3) the source-ordering problem: for a set of data sources, what is the best order to query them so as to retrieve answer tuples as fast as possible. We consider these optimization problems under several cost models and we show that, in general, they are intractable. We describe effective approximation algorithms that enable us to solve these problems in practice. We then identify the causes of the high complexity and show that for restricted classes, the optimization problems can be solved in polynomial time.