Privacy-preserving Distributed Clustering using Generative Models

  • Authors:
  • Srujana Merugu;Joydeep Ghosh

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a framework for clustering distributed datain unsupervised and semi-supervised scenarios, taking intoaccount privacy requirements and communication costs.Rather than sharing parts of the original or perturbed data,we instead transmit the parameters of suitable generativemodels built at each local data site to a central location.We mathematically show that the best representative of allthe data is a certain "mean" model, and empirically showthat this model can be approximated quite well by generatingartificial samples from the underlying distributions usingMarkov Chain Monte Carlo techniques, and then fittinga combined global model with a chosen parametric form tothese samples. We also propose a new measure that quantifiesprivacy based on information theoretic concepts, andshow that decreasing privacy leads to a higher quality of thecombined model and vice versa. We provide empirical resultson different data types to highlight the generality of ourframework. The results show that high quality distributedclustering can be achieved with little privacy loss and lowcommunication cost.