A Bayesian decision model for cost optimal record matching

Authors:
V. S. Verykios;G. V. Moustakides;M. G. Elfeky
Affiliations:
College of Information Science and Technology, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104-2875, USA;Computer Engineering and Informatics, University of Patras, Greece;Department of Computer Sciences, Purdue University, West Lafayette, IN 47907-1398, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2003

Citing 9
Cited 16

Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Adaptive detection of approximately duplicate database records and the database integration approach to information discovery

Adaptive detection of approximately duplicate database records and the database integration approach to information discovery
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Record linkage: making maximum use of the discriminating power of identifying information

Communications of the ACM
Introduction to Algorithms: A Creative Approach

Introduction to Algorithms: A Creative Approach
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Privacy-preserving data integration and sharing

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
A generalized cost optimal decision model for record matching

Proceedings of the 2004 international workshop on Information quality in information systems
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Optimal Stopping: A Record-Linkage Approach

Journal of Data and Information Quality (JDIQ)
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Public record aggregation using semi-supervised entity resolution

Proceedings of the 13th International Conference on Artificial Intelligence and Law
Quality-aware similarity assessment for entity matching in Web data

Information Systems
Decision models for record linkage

Data Mining
Aggregate queries on probabilistic record linkages

Proceedings of the 15th International Conference on Extending Database Technology
Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data

Computer Methods and Programs in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

In an error-free system with perfectly clean data, the construction of a global view of the data consists of linking - in relational terms, joining - two or more tables on their key fields. Unfortunately, most of the time, these data are neither carefully controlled for quality nor necessarily defined commonly across different data sources. As a result, the creation of such a global data view resorts to approximate joins. In this paper, an optimal solution is proposed for the matching or the linking of database record pairs in the presence of inconsistencies, errors or missing values in the data. Existing models for record matching rely on decision rules that minimize the probability of error, that is the probability that a sample (a measurement vector) is assigned to the wrong class. In practice though, minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of different samples may have different consequences. In this paper we present a decision model that minimizes the cost of making a decision. In particular: (a) we present a decision rule: (b) we prove that this rule is optimal with respect to the cost of a decision: and (c) we compute the probabilities of the two types of errors (Type I and Type II) that incur when this rule is applied. We also present a closed form decision model for a certain class of record comparison pairs along with an example, and results from comparing the proposed cost-based model to the error-based model, for large record comparison spaces.