Advances in knowledge discovery and data mining
Advances in knowledge discovery and data mining
Adaptive detection of approximately duplicate database records and the database integration approach to information discovery
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Record linkage: making maximum use of the discriminating power of identifying information
Communications of the ACM
Introduction to Algorithms: A Creative Approach
Introduction to Algorithms: A Creative Approach
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Privacy-preserving data integration and sharing
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
A generalized cost optimal decision model for record matching
Proceedings of the 2004 international workshop on Information quality in information systems
Exploiting relationships for object consolidation
Proceedings of the 2nd international workshop on Information quality in information systems
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Replica identification using genetic programming
Proceedings of the 2008 ACM symposium on Applied computing
The impact of parameter setup on a genetic programming approach to record deduplication
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Optimal Stopping: A Record-Linkage Approach
Journal of Data and Information Quality (JDIQ)
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Public record aggregation using semi-supervised entity resolution
Proceedings of the 13th International Conference on Artificial Intelligence and Law
Quality-aware similarity assessment for entity matching in Web data
Information Systems
Decision models for record linkage
Data Mining
Aggregate queries on probabilistic record linkages
Proceedings of the 15th International Conference on Extending Database Technology
Computer Methods and Programs in Biomedicine
Hi-index | 0.00 |
In an error-free system with perfectly clean data, the construction of a global view of the data consists of linking - in relational terms, joining - two or more tables on their key fields. Unfortunately, most of the time, these data are neither carefully controlled for quality nor necessarily defined commonly across different data sources. As a result, the creation of such a global data view resorts to approximate joins. In this paper, an optimal solution is proposed for the matching or the linking of database record pairs in the presence of inconsistencies, errors or missing values in the data. Existing models for record matching rely on decision rules that minimize the probability of error, that is the probability that a sample (a measurement vector) is assigned to the wrong class. In practice though, minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of different samples may have different consequences. In this paper we present a decision model that minimizes the cost of making a decision. In particular: (a) we present a decision rule: (b) we prove that this rule is optimal with respect to the cost of a decision: and (c) we compute the probabilities of the two types of errors (Type I and Type II) that incur when this rule is applied. We also present a closed form decision model for a certain class of record comparison pairs along with an example, and results from comparing the proposed cost-based model to the error-based model, for large record comparison spaces.