Probabilistic data generation for deduplication and data linkage

Authors:
Peter Christen
Affiliations:
Department of Computer Science, Australian National University, Canberra, Australia
Venue:
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Year:
2005

Citing 13
Cited 16

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate String Matching

ACM Computing Surveys (CSUR)
Automatic spelling correction in scientific and scholarly text

Communications of the ACM
A technique for computer detection and correction of spelling errors

Communications of the ACM
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering

Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
A two-step classification approach to unsupervised record linkage

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Automatic record linkage using seeded nearest neighbour and support vector machine classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Accurate Synthetic Generation of Realistic Personal Information

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment
Automatic training example selection for scalable unsupervised record linkage

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
A fast approach for parallel deduplication on multicore processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Decision models for record linkage

Data Mining
A transparent and transportable methodology for evaluating Data Linkage software

Journal of Biomedical Informatics
Temporal representation in spike detection of sparse personal identity streams

WISI'06 Proceedings of the 2006 international conference on Intelligence and Security Informatics
A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Information Sciences: an International Journal
Active sampling for entity matching

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
An evolutionary approach to complex schema matching

Information Systems
Active Sampling for Entity Matching with Guarantees

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.