Automatic threshold estimation for data matching applications

  • Authors:
  • Juliana Bonato dos Santos;Carlos A. Heuser;Viviane Moreira Orengo;Leandro Krug Wives

  • Affiliations:
  • Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brazil

  • Venue:
  • SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this context, the threshold definition is a central problem. In this paper, we propose a method for the estimation of the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. On the basis of the results of the proposed estimation process, and taking into account the requirements of a specific application, a user is able to choose a threshold value that is adequate for the application. The proposed estimation process is based on a clustering phase performed on a sample taken from a data collection and requires no human intervention.