Distribution-preserving statistical disclosure limitation

Authors:
Simon D. Woodcock;Gary Benedetto
Affiliations:
Simon Fraser University, Canada;US Census Bureau, United States
Venue:
Computational Statistics & Data Analysis
Year:
2009

Citing 3
Cited 1

A theoretical basis for perturbation methods

Statistics and Computing
Disclosure risk assessment in statistical microdata protection via advanced record linkage

Statistics and Computing
Perturbing Nonnormal Confidential Attributes: The Copula Approach

Management Science

Synthesizing: art of anonymization

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I

Quantified Score

Hi-index	0.03

Visualization

Abstract

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences based on the partially synthetic data, because the imputation model determines the distribution of synthetic values. We present a practical method to generate synthetic values when the imputer has only limited information about the true data generating process. We combine a simple imputation model (such as regression) with density-based transformations that preserve the distribution of the confidential data, up to sampling error, on specified subdomains. We demonstrate through simulations and a large scale application that our approach preserves important statistical properties of the confidential data, including higher moments, with low disclosure risk.