Workload-aware anonymization techniques for large-scale datasets

Authors:
Kristen LeFevre;David J. DeWitt;Raghu Ramakrishnan
Affiliations:
University of Michigan;Microsoft;Yahoo! Research
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2008

Citing 39
Cited 14

Security-control methods for statistical databases: a comparative study

ACM Computing Surveys (CSUR)
C4.5: programs for machine learning

C4.5: programs for machine learning
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Protecting Respondents' Identities in Microdata Release

IEEE Transactions on Knowledge and Data Engineering
Practical Data-Oriented Microaggregation for Statistical Disclosure Control

IEEE Transactions on Knowledge and Data Engineering
Database Mining: A Performance Perspective

IEEE Transactions on Knowledge and Data Engineering
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Achieving k-anonymity privacy protection using generalization and suppression

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Privacy preserving mining of association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Transforming data to satisfy privacy constraints

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Bottom-Up Generalization: A Data Mining Solution to Privacy Protection

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Top-Down Specialization for Information and Privacy Preservation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Privacy through Optimal k-Anonymization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
On the complexity of optimal K-anonymity

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Simulatable auditing

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Practical privacy: the SuLQ framework

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Incognito: efficient full-domain K-anonymity

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Checking for k-anonymity violation by views

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Prediction cubes

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Mondrian Multidimensional K-Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
\ell -Diversity: Privacy Beyond \kappa -Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Privacy via pseudorandom sketches

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Achieving anonymity via clustering

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Injecting utility into anonymized datasets

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Personalized privacy preservation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Workload-aware anonymization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Anonymizing sequential releases

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
The new Casper: query processing for location services without compromising privacy

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
M-invariance: towards privacy preserving re-publication of dynamic datasets

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maintaining data privacy in association rule mining

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
K-anonymization as spatial indexing: toward scalable and incremental anonymization

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Privacy skyline: privacy with multidimensional adversarial knowledge

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II
Anonymizing tables

ICDT'05 Proceedings of the 10th international conference on Database Theory
Toward privacy in public databases

TCC'05 Proceedings of the Second international conference on Theory of Cryptography
Calibrating noise to sensitivity in private data analysis

TCC'06 Proceedings of the Third conference on Theory of Cryptography

Anonymization-based attacks in privacy-preserving data publishing

ACM Transactions on Database Systems (TODS)
Anonymizing healthcare data: a case study on the blood transfusion service

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Privacy-Preserving Data Publishing

Foundations and Trends in Databases
Facilitating discovery on the private web using dataset digests

International Journal of Metadata, Semantics and Ontologies
Centralized and Distributed Anonymization for High-Dimensional Healthcare Data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Instant anonymization

ACM Transactions on Database Systems (TODS)
Differentially private data release for data mining

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Information based data anonymization for classification utility

Data & Knowledge Engineering
Privacy and utility for defect prediction: experiments with MORPH

Proceedings of the 34th International Conference on Software Engineering
Anonymizing set-valued data by nonreciprocal recoding

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Privacy consensus in anonymization systems via game theory

DBSec'12 Proceedings of the 26th Annual IFIP WG 11.3 conference on Data and Applications Security and Privacy
Anonymizing classification data using rough set theory

Knowledge-Based Systems
A new tool for sharing and querying of clinical documents modeled using HL7 Version 3 standard

Computer Methods and Programs in Biomedicine
A general framework for privacy preserving data publishing

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Protecting individual privacy is an important problem in microdata distribution and publishing. Anonymization algorithms typically aim to satisfy certain privacy definitions with minimal impact on the quality of the resulting data. While much of the previous literature has measured quality through simple one-size-fits-all measures, we argue that quality is best judged with respect to the workload for which the data will ultimately be used. This article provides a suite of anonymization algorithms that incorporate a target class of workloads, consisting of one or more data mining tasks as well as selection predicates. An extensive empirical evaluation indicates that this approach is often more effective than previous techniques. In addition, we consider the problem of scalability. The article describes two extensions that allow us to scale the anonymization algorithms to datasets much larger than main memory. The first extension is based on ideas from scalable decision trees, and the second is based on sampling. A thorough performance evaluation indicates that these techniques are viable in practice.