Privacy-preserving sharing of horizontally-distributed private data for constructing accurate classifiers

Authors:
Vincent Yan Fu Tan;See-Kiong Ng
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;Institute for Infocomm Research, Singapore
Venue:
PinKDD'07 Proceedings of the 1st ACM SIGKDD international conference on Privacy, security, and trust in KDD
Year:
2007

Citing 22
Cited 3

A data distortion by probability distribution

ACM Transactions on Database Systems (TODS)
Completeness theorems for non-cryptographic fault-tolerant distributed computation

STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Sample-based non-uniform random variate generation

WSC '86 Proceedings of the 18th conference on Winter simulation
A General Additive Data Perturbation Method for Database Security

Management Science
Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques

Data mining: concepts and techniques
On the design and quantification of privacy preserving data mining algorithms

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Cryptographic techniques for privacy-preserving data mining

ACM SIGKDD Explorations Newsletter
Randomization in privacy preserving data mining

ACM SIGKDD Explorations Newsletter
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Privacy preserving mining of association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Disclosure Limitation of Sensitive Rules

KDEX '99 Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange
On the Privacy Preserving Properties of Random Data Perturbation Techniques

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Deriving private information from randomized data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A privacy-sensitive approach to distributed clustering

Pattern Recognition Letters - Special issue: Advances in pattern recognition
A new scheme on privacy-preserving data classification

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

IEEE Transactions on Knowledge and Data Engineering
Privacy Preserving Data Classification with Rotation Perturbation

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
How to generate and exchange secrets

SFCS '86 Proceedings of the 27th Annual Symposium on Foundations of Computer Science
Our data, ourselves: privacy via distributed noise generation

EUROCRYPT'06 Proceedings of the 24th annual international conference on The Theory and Applications of Cryptographic Techniques
Polylogarithmic private approximations and efficient matching

TCC'06 Proceedings of the Third conference on Theory of Cryptography

PinKDD'07: privacy, security, and trust in KDD post-workshop report

ACM SIGKDD Explorations Newsletter - Special issue on visual analytics
Cloud-enabled privacy-preserving collaborative learning for mobile sensing

Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems
Bands of privacy preserving objectives: classification of PPDM strategies

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining tasks such as supervised classification can often benefit from a large training dataset. However, in many application domains, privacy concerns can hinder the construction of an accurate classifier by combining datasets from multiple sites. In this work, we propose a novel privacy-preserving distributed data sanitization algorithm that randomizes the private data at each site independently before the data is pooled to form a classifier at a centralized site. Distance-preserving perturbation approaches have been proposed by other researchers but we show that they can be susceptible to security risks. To enhance security, we require a unique non-distance-preserving approach. We use Kernel Density Estimation (KDE) Resampling, where samples are drawn independently from a distribution that is approximately equal to the original data's distribution. KDE Resampling provides consistent density estimates with randomized samples that are asymptotically independent of the original samples. This ensures high accuracy, especially when a large number of samples is available, with low privacy loss. We evaluated our approach on five standard datasets in a distributed setting using three different classifiers. The classification errors only deteriorated by 3% (in the worst case) when we used the randomized data instead of the original private data. With a large number of samples, KDE Resampling effectively preserves privacy (due to the asymptotic independence property) and also maintains the necessary data integrity for constructing accurate classifiers (due to consistency).