Publishing naive Bayesian classifiers: privacy without accuracy loss

Authors:
Barzan Mozafari;Carlo Zaniolo
Affiliations:
University of California Los Angeles, Los Angeles;University of California Los Angeles, Los Angeles
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 30
Cited 1

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Security of random data perturbation methods

ACM Transactions on Database Systems (TODS)
Auditing Boolean attributes

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Secure Databases: Constraints, Inference Channels, and Monitoring Disclosures

IEEE Transactions on Knowledge and Data Engineering
Limiting privacy breaches in privacy preserving data mining

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Assuring privacy when big brother is watching

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
On the Privacy Preserving Properties of Random Data Perturbation Techniques

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Using randomized response techniques for privacy-preserving data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
State-of-the-art in privacy preserving data mining

ACM SIGMOD Record
A formal analysis of information disclosure in data exchange

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
When do data mining results violate privacy?

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Sensitivity analysis in Bayesian networks: from single to multiple parameters

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
On the complexity of optimal K-anonymity

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Random-data perturbation techniques and privacy-preserving data mining

Knowledge and Information Systems
Simulatable auditing

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Deriving private information from randomized data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Incognito: efficient full-domain K-anonymity

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Checking for k-anonymity violation by views

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

IEEE Transactions on Knowledge and Data Engineering
Blocking Anonymity Threats Raised by Frequent Itemset Mining

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Probabilistic privacy analysis of published views

Proceedings of the 5th ACM workshop on Privacy in electronic society
L-diversity: Privacy beyond k-anonymity

ACM Transactions on Knowledge Discovery from Data (TKDD)
The boundary between privacy and utility in data publishing

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Evaluating privacy threats in released database views by symmetric indistinguishability

Journal of Computer Security - Selected papers from the Third and Fourth Secure Data Management (SDM) workshops
Butterfly: Protecting Output Privacy in Stream Mining

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Privacy in GLAV information integration

ICDT'07 Proceedings of the 11th international conference on Database Theory
Privacy in database publishing

ICDT'05 Proceedings of the 10th international conference on Database Theory
Indistinguishability: the other aspect of privacy

SDM'06 Proceedings of the Third VLDB international conference on Secure Data Management

A Knowledge Model Sharing Based Approach to Privacy-Preserving Data Mining

Transactions on Data Privacy

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of publishing a Naïve Bayesian Classifier (NBC) or, equivalently, publishing the necessary views for building an NBC, while protecting privacy of the individuals who provided the training data. Our approach completely preserves the accuracy of the original classifier, and thus significantly improves on current approaches, such as randomization or anonymization, which typically degrade accuracy to preserve privacy. Current query-view security checkers address the question of 'Is the view safe to publish?' and are computationally expensive (often Πp2-complete). Here instead, we tackle the question of 'How to make a view safe to publish?' and propose a linear-time algorithm to publish safe NBC-enabling views. We first show that a simple measure that restricts the ratios between the published NBC statistics is sufficient to prevent any breach of privacy. Then, we propose a linear-time algorithm to enforce this measure by producing perturbed statistics that assure both (i) individuals' privacy, and (ii) a classifier that behaves in the same way as the NBC trained on the original data. By carefully expressing the derived statistics using rational numbers, we can easily produce synthetic (sanitized) datasets. Thus, for any given dataset, we produce another dataset that is secure to publish (w.r.t. a uniform prior) and achieves the same classification accuracy. Finally, we extend our results by providing sufficient conditions to cope with arbitrary (non-uniform prior) distributions, and we validate their effectiveness in practice through experiments on real-world data.