Publishing naive Bayesian classifiers: privacy without accuracy loss

  • Authors:
  • Barzan Mozafari;Carlo Zaniolo

  • Affiliations:
  • University of California Los Angeles, Los Angeles;University of California Los Angeles, Los Angeles

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of publishing a Naïve Bayesian Classifier (NBC) or, equivalently, publishing the necessary views for building an NBC, while protecting privacy of the individuals who provided the training data. Our approach completely preserves the accuracy of the original classifier, and thus significantly improves on current approaches, such as randomization or anonymization, which typically degrade accuracy to preserve privacy. Current query-view security checkers address the question of 'Is the view safe to publish?' and are computationally expensive (often Πp2-complete). Here instead, we tackle the question of 'How to make a view safe to publish?' and propose a linear-time algorithm to publish safe NBC-enabling views. We first show that a simple measure that restricts the ratios between the published NBC statistics is sufficient to prevent any breach of privacy. Then, we propose a linear-time algorithm to enforce this measure by producing perturbed statistics that assure both (i) individuals' privacy, and (ii) a classifier that behaves in the same way as the NBC trained on the original data. By carefully expressing the derived statistics using rational numbers, we can easily produce synthetic (sanitized) datasets. Thus, for any given dataset, we produce another dataset that is secure to publish (w.r.t. a uniform prior) and achieves the same classification accuracy. Finally, we extend our results by providing sufficient conditions to cope with arbitrary (non-uniform prior) distributions, and we validate their effectiveness in practice through experiments on real-world data.