Generating Sufficiency-based Non-Synthetic Perturbed Data

Authors:
Krishnamurty Muralidhar;Rathindra Sarathy
Affiliations:
School of Management/ Gatton College of Business and Economics/ University of Kentucky Lexington KY 40506 USA. e-mail: krishm@uky.edu;Department of Management Science &/ Information Systems/ Spears School of Business/ Oklahoma State University, Stillwater OK 74073 USA e-mail: Sarathy@okstate.edu
Venue:
Transactions on Data Privacy
Year:
2008

Citing 4
Cited 4

A General Additive Data Perturbation Method for Database Security

Management Science
Information preserving statistical obfuscation

Statistics and Computing
A theoretical basis for perturbation methods

Statistics and Computing
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II

Perturbation of Numerical Confidential Data via Skew-t Distributions

Management Science
Hybrid microdata using microaggregation

Information Sciences: an International Journal
Hybrid microdata via model-based clustering

PSD'12 Proceedings of the 2012 international conference on Privacy in Statistical Databases
Disclosure Control of Confidential Data by Applying Pac Learning Theory

Journal of Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The mean vector and covariance matrix are sufficient statistics when the underlying distribution is multivariate normal. Many type of statistical analyses used in practice rely on the assumption of multivariate normality (Gaussian model). For these analyses, maintaining the mean vector and covari-ance matrix of the masked data to be the same as that of the original data implies that if the masked data is analyzed using these techniques, the results of such analysis will be the same as that using the original data. For numerical confidential data, a recently proposed perturbation method makes it possi-ble to maintain the mean vector and covariance matrix of the masked data to be exactly the same as the original data. However, as it is currently proposed, the perturbed values from this method are consid-ered synthetic because they are generated without considering the values of the confidential variables (and are based only on the non-confidential variables). Some researchers argue that synthetic data re-sults in information loss. In this study, we provide a new methodology for generating non-synthetic perturbed data that maintains the mean vector and covariance matrix of the masked data to be exactly the same as the original data while offering a selectable degree of similarity between original and per-turbed data.