Accurate Synthetic Generation of Realistic Personal Information

Authors:
Peter Christen;Agus Pudjijono
Affiliations:
School of Computer Science, The Australian National University, Canberra, Australia ACT 0200;Data Center, Ministry of Public Works of Republic of Indonesia, Jakarta, Indonesia 12110
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 10
Cited 12

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Approximate String Matching

ACM Computing Surveys (CSUR)
Automatic spelling correction in scientific and scholarly text

Communications of the ACM
A technique for computer detection and correction of spelling errors

Communications of the ACM
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
A Comparison of Personal Name Matching: Techniques and Practical Issues

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Privacy-Preserving Data Linkage and Geocoding: Current Approaches and Research Directions

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning

A constraint satisfaction cryptanalysis of bloom filters in private record linkage

PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage

Information Fusion
A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Information Sciences: an International Journal
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review
Efficient two-party private blocking based on sorted nearest neighborhood clustering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Flexible and extensible generation and corruption of personal data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
GeCo: an online personal data generator and corruptor

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large portion of data collected by many organisations today is about people, and often contains personal identifying information, such as names and addresses. Privacy and confidentiality are of great concern when such data is being shared between organisations or made publicly available. Research in (privacy-preserving) data mining and data linkage is suffering from a lack of publicly available real-world data sets that contain personal information, and therefore experimental evaluations can be difficult to conduct. In order to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data containing personal information with realistic characteristics, such as frequency distributions, attribute dependencies, and error probabilities. Our generator significantly improves earlier approaches, and allows the generation of data for individuals, families and households.