Privacy-preserving data mining: A feature set partitioning approach

  • Authors:
  • Nissim Matatov;Lior Rokach;Oded Maimon

  • Affiliations:
  • Department of Industrial Engineering, Tel-Aviv University, Israel;Department of Information System Engineering, Ben Gurion University of the Negev, Be'er Sheva 84105, Israel and Duetsche Telekom Laboratories at Ben Gurion University of the Negev, Be'er Sheva 841 ...;Department of Industrial Engineering, Tel-Aviv University, Israel

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2010

Quantified Score

Hi-index 0.07

Visualization

Abstract

In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers. Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with DMPD in order to compare its classification performance with other k-anonymity-based methods. The results suggest that DMPD performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance.