Data integration with dependent sources

Authors:
Anish Das Sarma;Xin Luna Dong;Alon Halevy
Affiliations:
Yahoo Research;AT&T Labs--Research;Google Inc.
Venue:
Proceedings of the 14th International Conference on Extending Database Technology
Year:
2011

Citing 15
Cited 2

Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The budgeted maximum coverage problem

Information Processing Letters
Approximation algorithms

Approximation algorithms
Quasi-Copies: Efficient Data Sharing for Information Retrieval Systems

EDBT '88 Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Obtaining Complete Answers from Incomplete Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration

IEEE Transactions on Knowledge and Data Engineering
Enterprise information integration: successes, challenges and controversies

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Provenance in databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Query optimization using local completeness

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Global detection of complex copying relationships between sources

Proceedings of the VLDB Endowment

Feedback-based data set recommendation for building linked data applications

Proceedings of the 8th International Conference on Semantic Systems
Entity ranking using click-log information

Intelligent Data Analysis

Quantified Score

Hi-index	0.02

Visualization

Abstract

Data integration systems offer users a uniform interface to a set of data sources. Previous work has typically assumed that the data sources are independent of each other; however, in scenarios involving large numbers of sources, such as the Web or large enterprises, there is an eco-system of dependent sources, where some sources copy parts of their data from others. This paper considers the new optimization problems that arise while answering queries over large number of dependent sources. These are the (1) cost-minimization problem: what is the minimum cost we must incur to get all answer tuples, (2) maximum-coverage problem: given a bound on the cost, how can we get the maximum possible coverage, and (3) the source-ordering problem: for a set of data sources, what is the best order to query them so as to retrieve answer tuples as fast as possible. We consider these optimization problems under several cost models and we show that, in general, they are intractable. We describe effective approximation algorithms that enable us to solve these problems in practice. We then identify the causes of the high complexity and show that for restricted classes, the optimization problems can be solved in polynomial time.