Query evaluation techniques for large databases
ACM Computing Surveys (CSUR)
Adaptive selectivity estimation using query feedback
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Dataflow query execution in a parallel main-memory environment
PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
The Vision of Autonomic Computing
Computer
Proceedings of the 27th International Conference on Very Large Data Bases
27th International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adapting to source properties in processing data integration queries
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Robust query processing through progressive optimization
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate joins: concepts and techniques
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Practical Adaptation to Changing Resources in Grid Query Processing
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Progressive optimization in a shared-nothing parallel database
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Foundations and Trends in Databases
Adapting to changing resource performance in grid query processing
DMG 2005 Proceedings of the First VLDB conference on Data Management in Grids
Progressive query optimization for federated queries
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A foundation for the replacement of pipelined physical join operators in adaptive query processing
EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Transactions on large-scale data- and knowledge-centered systems III
Hi-index | 0.00 |
Applications that involve data integration among multiple sources often require a preliminary step of data reconciliation in order to ensure that tuples match correctly across the sources. In dynamic settings such as data mashups, however, traditional offline data reconciliation techniques that require prior availability of the data may not be applicable. The alternative, performing similarity joins at query time, is computationally expensive, while ignoring the mismatch problem altogether leads to an incomplete integration. In this paper we make the assumption that, in some dynamic integration scenarios, users may agree to trade the completeness of a join result in return for a faster computation. We explore the consequences of this assumption by proposing a novel, hybrid join algorithm that involves a combination of exact and approximate join operators, managed using adaptive query processing techniques. The algorithm is optimistic: it can switch between physical join operators multiple times throughout query processing, but it only resorts to approximate join operators when there is statistical evidence that result completeness is compromised. Our experiments show that sensible savings in join execution time can be achieved in practice, at the expense of a modest reduction in result completeness.