Automatic training example selection for scalable unsupervised record linkage

Authors:
Peter Christen
Affiliations:
Department of Computer Science, The Australian National University, Canberra, ACT, Australia
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 11
Cited 8

Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
Quality Measures in Data Mining (Studies in Computational Intelligence)

Quality Measures in Data Mining (Studies in Computational Intelligence)
A two-step classification approach to unsupervised record linkage

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning

Automatic record linkage using seeded nearest neighbour and support vector machine classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Geocode Matching and Privacy Preservation

Privacy, Security, and Trust in KDD
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Multiple valued logic approach for matching patient records in multiple databases

Journal of Biomedical Informatics
Multiple instance learning for group record linkage

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A supervised learning and group linking method for historical census household linkage

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121

Quantified Score

Hi-index	0.00

Visualization

Abstract

Linking records from two or more databases is an increasingly important data preparation step in many data mining projects, as linked data can enable studies that are not feasible otherwise, or that would require expensive collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record linkage is the accurate classification of record pairs into matches and non-matches. Many modern classification techniques are based on supervised machine learning and thus require training data, which is often not available in real world situations. A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training examples are selected automatically, and they are then used in the second step to train a binary classifier. An experimental evaluation shows that this approach can outperform k-means clustering and also be much faster than other classification techniques.