A privacy-sensitive approach to distributed clustering

Authors:
Srujana Merugu;Joydeep Ghosh
Affiliations:
Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA;Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA
Venue:
Pattern Recognition Letters - Special issue: Advances in pattern recognition
Year:
2005

Citing 14
Cited 8

Distributed cooperative Bayesian Learning strategies

Information and Computation
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
On the design and quantification of privacy preserving data mining algorithms

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

Machine Learning
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
The inference problem: a survey

ACM SIGKDD Explorations Newsletter
Cryptographic techniques for privacy-preserving data mining

ACM SIGKDD Explorations Newsletter
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
On the Privacy Preserving Properties of Random Data Perturbation Techniques

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Privacy-preserving k-means clustering over vertically partitioned data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
Distributed clustering based on sampling local density estimates

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

A distributed learning framework for heterogeneous data sources

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
iLink: search and routing in social networks

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Collaborative clustering with the use of Fuzzy C-Means and its quantification

Fuzzy Sets and Systems
Metastructural facets of granular computing

International Journal of Knowledge Engineering and Soft Data Paradigms
A multifaceted perspective at data analysis: a study in collaborative intelligent agents

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on cybernetics and cognitive informatics
Collaborative architectures of fuzzy modeling

WCCI'08 Proceedings of the 2008 IEEE world conference on Computational intelligence: research frontiers
Privacy-preserving sharing of horizontally-distributed private data for constructing accurate classifiers

PinKDD'07 Proceedings of the 1st ACM SIGKDD international conference on Privacy, security, and trust in KDD
GoSCAN: Decentralized scalable data clustering

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

While data mining algorithms are often designed to operate on centralized data, in practice data is often acquired and stored in a distributed manner. Centralization of such data before analysis may not be desirable, and often not possible due to a variety of real-life constraints such as security, privacy and communication costs. This paper presents a general framework for distributed clustering that takes into account privacy requirements. It is based on building probabilistic models of the data at each local site, whose parameters are then transmitted to a central location. We mathematically show that the best representative of all the local models is a certain ''mean'' model, and empirically show that this model can be approximated quite well by generating artificial samples from the local models using sampling techniques, and then fitting a global model of a chosen parametric form to these samples. We also propose a new measure that quantifies privacy based on information theoretic concepts, and show that decreasing privacy improves the quality of the global model and vice versa. Empirical results are provided on different kinds of data to highlight the generality of our framework. The results show that high quality global clusters can be achieved with little loss of privacy.