PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The budgeted maximum coverage problem
Information Processing Letters
Approximation algorithms
Quasi-Copies: Efficient Data Sharing for Information Retrieval Systems
EDBT '88 Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Obtaining Complete Answers from Incomplete Databases
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration
IEEE Transactions on Knowledge and Data Engineering
Enterprise information integration: successes, challenges and controversies
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data integration: the teenage years
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Query optimization using local completeness
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Global detection of complex copying relationships between sources
Proceedings of the VLDB Endowment
Feedback-based data set recommendation for building linked data applications
Proceedings of the 8th International Conference on Semantic Systems
Entity ranking using click-log information
Intelligent Data Analysis
Hi-index | 0.02 |
Data integration systems offer users a uniform interface to a set of data sources. Previous work has typically assumed that the data sources are independent of each other; however, in scenarios involving large numbers of sources, such as the Web or large enterprises, there is an eco-system of dependent sources, where some sources copy parts of their data from others. This paper considers the new optimization problems that arise while answering queries over large number of dependent sources. These are the (1) cost-minimization problem: what is the minimum cost we must incur to get all answer tuples, (2) maximum-coverage problem: given a bound on the cost, how can we get the maximum possible coverage, and (3) the source-ordering problem: for a set of data sources, what is the best order to query them so as to retrieve answer tuples as fast as possible. We consider these optimization problems under several cost models and we show that, in general, they are intractable. We describe effective approximation algorithms that enable us to solve these problems in practice. We then identify the causes of the high complexity and show that for restricted classes, the optimization problems can be solved in polynomial time.