Maximum entropy simulation for microdata protection

Authors:
Silvia Polettini
Affiliations:
ISTAT, Servizio della Metodologia di Base per la Produzione Statistica, Via Cesare Balbo 16, 00184 Roma, Italy. polettin@istat.it
Venue:
Statistics and Computing
Year:
2003

Citing 16
Cited 4

An adaptive algorithm for the approximate calculation of multiple integrals

ACM Transactions on Mathematical Software (TOMS)
Algorithm 698: DCUHRE: an adaptive multidemensional integration routine for a vector of integrals

ACM Transactions on Mathematical Software (TOMS)
A limited memory algorithm for bound constrained optimization

SIAM Journal on Scientific Computing
Minimax entropy principle and its application to texture modeling

Neural Computation
Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

ACM Transactions on Mathematical Software (TOMS)
A General Additive Data Perturbation Method for Database Security

Management Science
Microdata Protection through Noise Addition

Inference Control in Statistical Databases, From Theory to Practice
Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique

Inference Control in Statistical Databases, From Theory to Practice
LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection

Inference Control in Statistical Databases, From Theory to Practice
Model Based Disclosure Protection

Inference Control in Statistical Databases, From Theory to Practice
Spatial and non-spatial model-based protection procedures for the release of business microdata

Statistics and Computing
Information preserving statistical obfuscation

Statistics and Computing
A theoretical basis for perturbation methods

Statistics and Computing
Model Diagnostics for Remote Access Regression Servers

Statistics and Computing
Remote access systems for statistical analysis of microdata

Statistics and Computing
Perturbing Nonnormal Confidential Attributes: The Copula Approach

Management Science

Spatial and non-spatial model-based protection procedures for the release of business microdata

Statistics and Computing
A rejoinder to the comments by Polettini and Stander

Statistics and Computing
Model Diagnostics for Remote Access Regression Servers

Statistics and Computing
Privacy-Preserving Data Publishing

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functions; these are constrained to be equal to their corresponding sample averages; the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability.Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality.The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traits; moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information.Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major point; related issues are selection of characteristics, goodness of fit and strength of dependence relations.