Automatic threshold estimation for data matching applications

Authors:
Juliana B. dos Santos;Carlos A. Heuser;Viviane P. Moreira;Leandro K. Wives
Affiliations:
Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Caixa Postal 15.064, 91.501-970 Porto Alegre, RS, Brazil;Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Caixa Postal 15.064, 91.501-970 Porto Alegre, RS, Brazil;Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Caixa Postal 15.064, 91.501-970 Porto Alegre, RS, Brazil;Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Caixa Postal 15.064, 91.501-970 Porto Alegre, RS, Brazil
Venue:
Information Sciences: an International Journal
Year:
2011

Citing 23
Cited 1

VAGUE: a user interface to relational databases that permits vague queries

ACM Transactions on Information Systems (TOIS)
The score-distributional threshold optimization for adaptive binary classification tasks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Maximum likelihood estimation for filtering thresholds

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Using Automatic Process Clustering for Design Recovery and Distributed Debugging

IEEE Transactions on Software Engineering
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Generalization Error Bounds for Threshold Decision Lists

The Journal of Machine Learning Research
A near-optimal similarity join algorithm and performance evaluation

Information Sciences—Informatics and Computer Science: An International Journal
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Reasoning About Approximate Match Query Results

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure

ICCIMA '07 Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02
SimEval: a tool for evaluating the quality of similarity functions

ER '07 Tutorials, posters, panels and industrial contributions at the 26th international conference on Conceptual modeling - Volume 83
Introduction to Information Retrieval

Introduction to Information Retrieval
Determining the best K for clustering transactional datasets: A coverage density-based approach

Data & Knowledge Engineering
On comparing two sequences of numbers and its applications to clustering analysis

Information Sciences: an International Journal
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
A comparison of extrinsic clustering evaluation metrics based on formal constraints

Information Retrieval
Evaluation of novelty metrics for sentence-level novelty mining

Information Sciences: an International Journal
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering

Frequency-based similarity for parameterized sequences: Formal framework, algorithms, and applications

Information Sciences: an International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Several advanced data management applications, such as data integration, data deduplication, and similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to decide whether two different data instances match, i.e., if they represent the same real world object. In this context, threshold definition is a central problem. This paper proposes a method for estimating the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. Based on the results of the proposed estimation process and the requirements of a specific application, a user is able to choose a suitable threshold value. The estimation process is based on a clustering phase performed over a data collection (or a sample thereof) and requires no human intervention since the choice of similarity threshold is based on the silhouette coefficient, which is an internal quality measure for clusters. An extensive set of experiments on artificial and real datasets demonstrates the effectiveness of the proposed approach. The results of the experiments show that in most cases the estimation error was below 10% in terms of precision and recall.