Automatic threshold estimation for data matching applications

Authors:
Juliana Bonato dos Santos;Carlos A. Heuser;Viviane Moreira Orengo;Leandro Krug Wives
Affiliations:
Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil
Venue:
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Year:
2008

Citing 12
Cited 0

VAGUE: a user interface to relational databases that permits vague queries

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Using Automatic Process Clustering for Design Recovery and Distributed Debugging

IEEE Transactions on Software Engineering
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Reasoning About Approximate Match Query Results

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure

ICCIMA '07 Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02
SimEval: a tool for evaluating the quality of similarity functions

ER '07 Tutorials, posters, panels and industrial contributions at the 26th international conference on Conceptual modeling - Volume 83
Introduction to Information Retrieval

Introduction to Information Retrieval
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this context, the threshold definition is a central problem. In this paper, we propose a method for the estimation of the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. On the basis of the results of the proposed estimation process, and taking into account the requirements of a specific application, a user is able to choose a threshold value that is adequate for the application. The proposed estimation process is based on a clustering phase performed on a sample taken from a data collection and requires no human intervention.