A framework for condensation-based anonymization of string data

  • Authors:
  • Charu C. Aggarwal;Philip S. Yu

  • Affiliations:
  • IBM T. J. Watson Research Center, Hawthorne, USA;University of Illinois at Chicago, Chicago, USA

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. An important method for privacy preserving data mining is the method of condensation. This method is often used in the case of multi-dimensional data in which pseudo-data is generated to mask the true values of the records. However, these methods are not easily applicable to the case of string data, since they require the use of multi-dimensional statistics in order to generate the pseudo-data. String data are especially important in the privacy preserving data-mining domain because most DNA and biological data are coded as strings. In this article, we will discuss a new method for privacy preserving mining of string data with the use of simple template-based models. The template-based model turns out to be effective in practice, and preserves important statistical characteristics of the strings such as intra-record distances. We will explore the behavior in the context of a classification application, and show that the accuracy of the application is not affected significantly by the anonymization process.