Time-completeness trade-offs in record linkage using adaptive query processing

Authors:
Roald Lengu;Paolo Missier;Alvaro A. A. Fernandes;Giovanna Guerrini;Marco Mesiti
Affiliations:
Università di Genova, Italy;University of Manchester, UK;University of Manchester, UK;Università di Genova, Italy;Università di Milano, Italy
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 22
Cited 2

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Adaptive selectivity estimation using query feedback

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Dataflow query execution in a parallel main-memory environment

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
The Vision of Autonomic Computing

Computer
Proceedings of the 27th International Conference on Very Large Data Bases

27th International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adapting to source properties in processing data integration queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Robust query processing through progressive optimization

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate joins: concepts and techniques

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Practical Adaptation to Changing Resources in Grid Query Processing

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Progressive optimization in a shared-nothing parallel database

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive query processing

Foundations and Trends in Databases
Adapting to changing resource performance in grid query processing

DMG 2005 Proceedings of the First VLDB conference on Data Management in Grids
Progressive query optimization for federated queries

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A foundation for the replacement of pipelined physical join operators in adaptive query processing

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology

Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Integrating large and distributed life sciences resources for systems biology research: progress and new challenges

Transactions on large-scale data- and knowledge-centered systems III

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications that involve data integration among multiple sources often require a preliminary step of data reconciliation in order to ensure that tuples match correctly across the sources. In dynamic settings such as data mashups, however, traditional offline data reconciliation techniques that require prior availability of the data may not be applicable. The alternative, performing similarity joins at query time, is computationally expensive, while ignoring the mismatch problem altogether leads to an incomplete integration. In this paper we make the assumption that, in some dynamic integration scenarios, users may agree to trade the completeness of a join result in return for a faster computation. We explore the consequences of this assumption by proposing a novel, hybrid join algorithm that involves a combination of exact and approximate join operators, managed using adaptive query processing techniques. The algorithm is optimistic: it can switch between physical join operators multiple times throughout query processing, but it only resorts to approximate join operators when there is statistical evidence that result completeness is compromised. Our experiments show that sensible savings in join execution time can be achieved in practice, at the expense of a modest reduction in result completeness.