Privacy-preserving data mining through knowledge model sharing

  • Authors:
  • Patrick Sharkey;Hongwei Tian;Weining Zhang;Shouhuai Xu

  • Affiliations:
  • Department of Computer Science, University of Texas at San Antonio;Department of Computer Science, University of Texas at San Antonio;Department of Computer Science, University of Texas at San Antonio;Department of Computer Science, University of Texas at San Antonio

  • Venue:
  • PinKDD'07 Proceedings of the 1st ACM SIGKDD international conference on Privacy, security, and trust in KDD
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Privacy-preserving data mining (PPDM) is an important topic to both industry and academia. In general there are two approaches to tackling PPDM, one is statistics-based and the other is crypto-based. The statistics-based approach has the advantage of being efficient enough to deal with large volume of datasets. The basic idea underlying this approach is to let the data owners publish some sanitized versions of their data (e.g., via perturbation, generalization, or l-diversification), which are then used for extracting useful knowledge models such as decision trees. In this paper, we present a new method for statistics-based PPDM. Our method differs from the existing ones because it lets the data owners share with each other the knowledge models extracted from their own private datasets, rather than to let the data owners publish any of their own private datasets (not even in any sanitized form). The knowledge models derived from the individual datasets are used to generate some pseudo-data that are then used for extracting the desired "global" knowledge models. While instrumental, there are some technical subtleties that need be carefully addressed. Specifically, we propose an algorithm for generating pseudo-data according to paths of a decision tree, a method for adapting anonymity measures of datasets to measure the privacy of decision trees, and an algorithm that prunes a decision tree to satisfy a given anonymity requirement. Through an empirical study, we show that predictive models learned using our method are significantly more accurate than those learned using the existing l-diversity method in both centralized and distributed environments with different types of datasets, predictive models, and utility measures.