Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data

Authors:
M. Sariyar;A. Borg
Affiliations:
Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg University Mainz, Germany;Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Centre of the Johannes Gutenberg University Mainz, Germany
Venue:
Computer Methods and Programs in Biomedicine
Year:
2012

Citing 22
Cited 0

Bagging predictors

Machine Learning
Automating the approximate record-matching process

Information Sciences—Informatics and Computer Science: An International Journal
Efficient data reconciliation

Information Sciences: an International Journal
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Duplicate detection in adverse drug reaction surveillance

Data Mining and Knowledge Discovery
Preserving privacy in association rule mining with bloom filters

Journal of Intelligent Information Systems
Large margin vs. large volume in transductive learning

Machine Learning
Data Quality and Record Linkage Techniques

Data Quality and Record Linkage Techniques
Active learning with multiple views

Journal of Artificial Intelligence Research
Introduction to Semi-Supervised Learning

Introduction to Semi-Supervised Learning
2010 Special Issue: Semi-supervised learning for tree-structured ensembles of RBF networks with Co-Training

Neural Networks
Co-training with relevant random subspaces

Neurocomputing
Question classification based on co-training style semi-supervised learning

Pattern Recognition Letters
A new co-training-style random forest for computer aided diagnosis

Journal of Intelligent Information Systems
Controlling false match rates in record linkage using extreme value theory

Journal of Biomedical Informatics
Using co-training and self-training in semi-supervised multiple classifier systems

SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Decision models for record linkage

Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record linkage or deduplication deals with the detection and deletion of duplicates in and across files. For this task, this paper introduces and evaluates two new machine-learning methods (bumping and multiview) together with bagging, a tree-based ensemble-approach. Whereas bumping represents a tree-based approach as well, multiview is based on the combination of different methods and the semi-supervised learning principle. After providing a theoretical background of the methods, initial empirical results on patient identity data are given. In the empirical evaluation, we calibrate the methods on three different kinds of training data. The results show that the smallest training data set, which is obtained by a simple active learning strategy, leads to the best results. Multiview can outperform the other methods only when all are calibrated on a randomly sampled training set; in all other cases, it performs worse. The results of bumping do not differ significantly from the overall best performing method bagging. We cautiously conclude that tree-based record linkage methods are likely to produce similar results because of the low-dimensionality (p@?n) and straightforwardness of the underlying problem. Multiview is possibly rather suitable for problems that are more sophisticated.