Machine Learning
PSD '08 Proceedings of the UNESCO Chair in data privacy international conference on Privacy in Statistical Databases
Verification servers: Enabling analysts to assess the quality of inferences from public use data
Computational Statistics & Data Analysis
ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II
Using support vector machines for generating synthetic datasets
PSD'10 Proceedings of the 2010 international conference on Privacy in statistical databases
Hybrid microdata via model-based clustering
PSD'12 Proceedings of the 2012 international conference on Privacy in Statistical Databases
Transactions on Data Privacy
Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - FUZZYSS'2011: 2nd International Fuzzy Systems Symposium
Hi-index | 0.00 |
Several national statistical agencies are now releasing partially synthetic, public use microdata. These comprise the units in the original database with sensitive or identifying values replaced with values simulated from statistical models. Specifying synthesis models can be daunting in databases that includemany variables of diverse types. These variablesmay be related inways that can be difficult to capture with standard parametric tools. In this article, we describe how random forests can be adapted to generate partially synthetic data for categorical variables. Using an empirical study, we illustrate that the random forest synthesizer can preserve relationships reasonably well while providing low disclosure risks. The random forest synthesizer has some appealing features for statistical agencies: it can be applied with minimal tuning, easily incorporates numerical, categorical, and mixed variables as predictors, operates efficiently in high dimensions, and automatically fits non-linear relationships.