Automatic record linkage using seeded nearest neighbour and support vector machine classification

Authors:
Peter Christen
Affiliations:
The Australian National University, Canberra, Australia
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 15
Cited 17

Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Text classification from positive and unlabeled documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
A two-step classification approach to unsupervised record linkage

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Automatic training example selection for scalable unsupervised record linkage

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Decision models for record linkage

Data Mining

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Geocode Matching and Privacy Preservation

Privacy, Security, and Trust in KDD
Generic Entity Resolution in Relational Databases

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Privacy and anonymization for very large datasets

Proceedings of the 18th ACM conference on Information and knowledge management
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
A community question-answering refinement system

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Ontology and instance matching

Knowledge-driven multimedia information extraction and ontology evolution
Fusion of similarity measures for time series classification

HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part II
Personalized book recommendations created by using social media data

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Computer-based genealogy reconstruction in founder populations

Journal of Biomedical Informatics
Multiple valued logic approach for matching patient records in multiple databases

Journal of Biomedical Informatics
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Automatic SLA Matching and Provider Selection in Grid and Cloud Computing Markets

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
A group recommender for movies based on content similarity and popularity

Information Processing and Management: an International Journal
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

Data & Knowledge Engineering
Tracking people over time in 19th century Canada for longitudinal analysis

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.