Generating Sufficiency-based Non-Synthetic Perturbed Data

  • Authors:
  • Krishnamurty Muralidhar;Rathindra Sarathy

  • Affiliations:
  • School of Management/ Gatton College of Business and Economics/ University of Kentucky Lexington KY 40506 USA. e-mail: krishm@uky.edu;Department of Management Science &/ Information Systems/ Spears School of Business/ Oklahoma State University, Stillwater OK 74073 USA e-mail: Sarathy@okstate.edu

  • Venue:
  • Transactions on Data Privacy
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The mean vector and covariance matrix are sufficient statistics when the underlying distribution is multivariate normal. Many type of statistical analyses used in practice rely on the assumption of multivariate normality (Gaussian model). For these analyses, maintaining the mean vector and covari-ance matrix of the masked data to be the same as that of the original data implies that if the masked data is analyzed using these techniques, the results of such analysis will be the same as that using the original data. For numerical confidential data, a recently proposed perturbation method makes it possi-ble to maintain the mean vector and covariance matrix of the masked data to be exactly the same as the original data. However, as it is currently proposed, the perturbed values from this method are consid-ered synthetic because they are generated without considering the values of the confidential variables (and are based only on the non-confidential variables). Some researchers argue that synthetic data re-sults in information loss. In this study, we provide a new methodology for generating non-synthetic perturbed data that maintains the mean vector and covariance matrix of the masked data to be exactly the same as the original data while offering a selectable degree of similarity between original and per-turbed data.