A two-step classification approach to unsupervised record linkage

Authors:
Peter Christen
Affiliations:
The Australian National University, Canberra ACT, Australia
Venue:
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Year:
2007

Citing 20
Cited 9

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A Comparison of Personal Name Matching: Techniques and Practical Issues

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Decision models for record linkage

Data Mining
Automatically detecting criminal identity deception: an adaptive detection algorithm

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Automatic record linkage using seeded nearest neighbour and support vector machine classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The Normalized Compression Distance as a Distance Measure in Entity Identification

ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
Automatic training example selection for scalable unsupervised record linkage

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Entity Resolution and Information Quality

Entity Resolution and Information Quality
Ontology and instance matching

Knowledge-driven multimedia information extraction and ontology evolution
Improving classifications for cardiac autonomic neuropathy using multi-level ensemble classifiers and feature selection based on random forest

AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134

Quantified Score

Hi-index	0.00

Visualization

Abstract

Linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect manually. A main challenge when linking large databases is the classification of the compared record pairs into matches and non-matches. In traditional record linkage, classification thresholds have to be set either manually or using an EM-based approach. More recently developed classification methods are mainly based on supervised machine learning techniques and thus require training data, which is often not available in real world situations or has to be prepared manually. In this paper, a novel two-step approach to record pair classification is presented. In a first step, example training data of high quality is generated automatically, and then used in a second step to train a supervised classifier. Initial experimental results on both real and synthetic data show that this approach can outperform traditional unsupervised clustering, and even achieve linkage quality almost as good as fully supervised techniques.