Reasoning About Approximate Match Query Results

Authors:
Sudipto Guha;Nick Koudas;Divesh Srivastava;Xiaohui Yu
Affiliations:
U of Pennsylvania;U of Toronto;AT&T Labs Research;U of Toronto
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 6

A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Automatic threshold estimation for data matching applications

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Automatic threshold estimation for data matching applications

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Join techniques deploying approximate match predicates are fundamental data cleaning operations. A variety of predicates have been utilized to quantify approximate match in such operations and some have been embedded in a declarative data cleaning framework. These techniques return pairs of tuples from both relations, tagged with a score, signifying the degree of similarity between the tuples in the pair according to the specific approximate match predicate. In this paper, we consider the problem of estimating various parameters on the output of declarative approximate join algorithms for planning purposes. Such algorithms are highly time consuming, so precise knowledge of the result size as well as its score distribution is a pressing concern. This knowledge aids decisions as to which operations are more promising for identifying highly similar tuples, which is a key operation for data cleaning. We propose solution strategies that fully comply with a declarative framework and analytically reason about the quality of the estimates we obtain as well as the performance of our strategies. We present the results of a detailed performance evaluation of all strategies proposed. Our experimental results validate our analytical expectations and shed additional light on the quality and performance of our estimation framework. Our study offers a set of simple, fully declarative techniques for this problem, which can be readily deployed in data cleaning systems.