A distributed learning framework for heterogeneous data sources

  • Authors:
  • Srujana Merugu;Joydeep Ghosh

  • Affiliations:
  • The University of Texas at Austin;The University of Texas at Austin

  • Venue:
  • Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a probabilistic model-based framework for distributed learning that takes into account privacy restrictions and is applicable to scenarios where the different sites have diverse, possibly overlapping subsets of features. Our framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global probabilistic model based on the union of the features available at all the sites. We provide a mathematical formulation of the model integration problem using the maximum likelihood and maximum entropy principles and describe iterative algorithms that are guaranteed to converge to the optimal solution. For certain commonly occurring special cases involving hierarchically ordered feature sets or conditional independence, we obtain closed form solutions and use these to propose an efficient alternative scheme by recursive decomposition of the model integration problem. To address interpretability concerns, we also present a modified formulation where the global model is assumed to belong to a specified parametric family. Finally, to highlight the generality of our framework, we provide empirical results for various learning tasks such as clustering and classification on different kinds of datasets consisting of continuous vector, categorical and directional attributes. The results show that high quality global models can be obtained without much loss of privacy.