Workload-aware anonymization techniques for large-scale datasets

  • Authors:
  • Kristen LeFevre;David J. DeWitt;Raghu Ramakrishnan

  • Affiliations:
  • University of Michigan;Microsoft;Yahoo! Research

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Protecting individual privacy is an important problem in microdata distribution and publishing. Anonymization algorithms typically aim to satisfy certain privacy definitions with minimal impact on the quality of the resulting data. While much of the previous literature has measured quality through simple one-size-fits-all measures, we argue that quality is best judged with respect to the workload for which the data will ultimately be used. This article provides a suite of anonymization algorithms that incorporate a target class of workloads, consisting of one or more data mining tasks as well as selection predicates. An extensive empirical evaluation indicates that this approach is often more effective than previous techniques. In addition, we consider the problem of scalability. The article describes two extensions that allow us to scale the anonymization algorithms to datasets much larger than main memory. The first extension is based on ideas from scalable decision trees, and the second is based on sampling. A thorough performance evaluation indicates that these techniques are viable in practice.