A Knowledge Model Sharing Based Approach to Privacy-Preserving Data Mining

  • Authors:
  • Hongwei Tian;Weining Zhang;Shouhuai Xu;Patrick Sharkey

  • Affiliations:
  • Department of Computer Science, University of Texas at San Antonio. e-mail: htian@cs.utsa.edu;Department of Computer Science, University of Texas at San Antonio. e-mail: wzhang@cs.utsa.edu;Department of Computer Science, University of Texas at San Antonio. e-mail: shxu@cs.utsa.edu;Department of Computer Science, University of Texas at San Antonio. e-mail: psharkey@cs.utsa.edu

  • Venue:
  • Transactions on Data Privacy
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Privacy-preserving data mining (PPDM) is an important problem and is currently studied in three approaches: the cryptographic approach, the data publishing, and the model publishing. However, each of these approaches has some problems. The cryptographic approach does not protect privacy of learned knowledge models and may have performance and scalability issues. The data publishing, although is popular, may suffer from too much utility loss for certain types of data mining applications. The model publishing is lacking of efficient algorithms for practical use in a multiple data source environment. In this paper, we present a knowledge model sharing based approach which learns a global knowledge model from pseudo-data generated according to anonymized knowledge models published by local data sources. Specifically, for the anonymization of knowledge models, we present two privacy measures for decision trees and an algorithm that obtains an anonymized decision tree by tree pruning. For the pseudo-data generation, we present an algorithm that generates useful pseudo-data from decision trees. We empirically study our method by comparing it with several PPDM methods that utilize existing techniques, including three methods that publish anonymized-data, one method that learns anonymized decision trees directly from the original-data, and one method that uses ensemble classification. Our results show that in both single data source and multiple data source environments and for several different datasets, predictive models, and utility measures, our method can obtain significantly better predictive models (especially, decision trees) than the other methods.