A privacy-sensitive approach to distributed clustering

  • Authors:
  • Srujana Merugu;Joydeep Ghosh

  • Affiliations:
  • Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA;Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA

  • Venue:
  • Pattern Recognition Letters - Special issue: Advances in pattern recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

While data mining algorithms are often designed to operate on centralized data, in practice data is often acquired and stored in a distributed manner. Centralization of such data before analysis may not be desirable, and often not possible due to a variety of real-life constraints such as security, privacy and communication costs. This paper presents a general framework for distributed clustering that takes into account privacy requirements. It is based on building probabilistic models of the data at each local site, whose parameters are then transmitted to a central location. We mathematically show that the best representative of all the local models is a certain ''mean'' model, and empirically show that this model can be approximated quite well by generating artificial samples from the local models using sampling techniques, and then fitting a global model of a chosen parametric form to these samples. We also propose a new measure that quantifies privacy based on information theoretic concepts, and show that decreasing privacy improves the quality of the global model and vice versa. Empirical results are provided on different kinds of data to highlight the generality of our framework. The results show that high quality global clusters can be achieved with little loss of privacy.