Random Forests for Generating Partially Synthetic, Categorical Data

Authors:
Gregory Caiola;Jerome P. Reiter
Affiliations:
Department of Statistical Science, Duke University, Durham, NC 27708, USA. e-mail: gregory.caiola@duke.edu;Department of Statistical Science, Duke University, Durham, NC 27708, USA. e-mail: jerry@stat.duke.edu
Venue:
Transactions on Data Privacy
Year:
2010

Citing 5
Cited 4

Random Forests

Machine Learning
Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data

PSD '08 Proceedings of the UNESCO Chair in data privacy international conference on Privacy in Statistical Databases
Verification servers: Enabling analysts to assess the quality of inferences from public use data

Computational Statistics & Data Analysis
Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel

Transactions on Data Privacy
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II

Using support vector machines for generating synthetic datasets

PSD'10 Proceedings of the 2010 international conference on Privacy in statistical databases
Hybrid microdata via model-based clustering

PSD'12 Proceedings of the 2012 international conference on Privacy in Statistical Databases
Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data

Transactions on Data Privacy
On sampling strategies for small and continuous data with the modeling of genetic programming and adaptive neuro-fuzzy inference system

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - FUZZYSS'2011: 2nd International Fuzzy Systems Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several national statistical agencies are now releasing partially synthetic, public use microdata. These comprise the units in the original database with sensitive or identifying values replaced with values simulated from statistical models. Specifying synthesis models can be daunting in databases that includemany variables of diverse types. These variablesmay be related inways that can be difficult to capture with standard parametric tools. In this article, we describe how random forests can be adapted to generate partially synthetic data for categorical variables. Using an empirical study, we illustrate that the random forest synthesizer can preserve relationships reasonably well while providing low disclosure risks. The random forest synthesizer has some appealing features for statistical agencies: it can be applied with minimal tuning, easily incorporates numerical, categorical, and mixed variables as predictors, operates efficiently in high dimensions, and automatically fits non-linear relationships.