Machine Learning
Automating the approximate record-matching process
Information Sciences—Informatics and Computer Science: An International Journal
Information Sciences: an International Journal
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Duplicate detection in adverse drug reaction surveillance
Data Mining and Knowledge Discovery
Preserving privacy in association rule mining with bloom filters
Journal of Intelligent Information Systems
Large margin vs. large volume in transductive learning
Machine Learning
Data Quality and Record Linkage Techniques
Data Quality and Record Linkage Techniques
Active learning with multiple views
Journal of Artificial Intelligence Research
Introduction to Semi-Supervised Learning
Introduction to Semi-Supervised Learning
Co-training with relevant random subspaces
Neurocomputing
Question classification based on co-training style semi-supervised learning
Pattern Recognition Letters
A new co-training-style random forest for computer aided diagnosis
Journal of Intelligent Information Systems
Controlling false match rates in record linkage using extreme value theory
Journal of Biomedical Informatics
Using co-training and self-training in semi-supervised multiple classifier systems
SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Decision models for record linkage
Data Mining
Hi-index | 0.00 |
Record linkage or deduplication deals with the detection and deletion of duplicates in and across files. For this task, this paper introduces and evaluates two new machine-learning methods (bumping and multiview) together with bagging, a tree-based ensemble-approach. Whereas bumping represents a tree-based approach as well, multiview is based on the combination of different methods and the semi-supervised learning principle. After providing a theoretical background of the methods, initial empirical results on patient identity data are given. In the empirical evaluation, we calibrate the methods on three different kinds of training data. The results show that the smallest training data set, which is obtained by a simple active learning strategy, leads to the best results. Multiview can outperform the other methods only when all are calibrated on a randomly sampled training set; in all other cases, it performs worse. The results of bumping do not differ significantly from the overall best performing method bagging. We cautiously conclude that tree-based record linkage methods are likely to produce similar results because of the low-dimensionality (p@?n) and straightforwardness of the underlying problem. Multiview is possibly rather suitable for problems that are more sophisticated.