A supervised learning and group linking method for historical census household linkage

Authors:
Zhichun Fu;Peter Christen;Mac Boot
Affiliations:
The Australian National University, Canberra, Australia;The Australian National University, Canberra, Australia;The Australian National University, Canberra, Australia
Venue:
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Year:
2011

Citing 11
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling up duplicate detection in graph data

Proceedings of the 17th ACM conference on Information and knowledge management
Automatic training example selection for scalable unsupervised record linkage

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Automatic Cleaning and Linking of Historical Census Data Using Household Information

ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Historical census data provide a snapshot of the era when our ancestors lived. Such data contain valuable information that allows the reconstruction of households and the tracking of family changes across time, allows the analysis of family diseases, and facilitates a variety of social science research. One particular topic of interest in historical census data analysis are households and linking them across time. This enables tracking of the majority of members in a household over a certain period of time, which facilitates the extraction of information that is hidden in the data, such as fertility, occupations, changes in family structures, immigration and movements, and so on. Such information normally cannot be easily acquired by only linking records that correspond to individuals. In this paper, we propose a novel method to link households in historical census data. Our method first computes the attribute-wise similarity of individual record pairs. A support vector machine classifier is then trained on limited data and used to classify these individual record pairs into matches and non-matches. In a second step, a group linking approach is employed to link households based on the matched individual record pairs. Experimental results on real census data from the United Kingdom from 1851 to 1901 show that the proposed method can greatly reduce the number of multiple household matches compared with a traditional linkage of individual record pairs only.