Spectral anonymization of data

  • Authors:
  • Peter Szolovits;Staal A. Vinterbo;Thomas Anton Lasko

  • Affiliations:
  • Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology

  • Venue:
  • Spectral anonymization of data
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed Over the past few decades, most of them perturbing or suppressing variable values in the original dataset. None of these have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to he made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing a spectral basis, such as that provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state-of-the-art swapping merely by a change of basis. The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4 th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)