CONQUEST: A Coarse-Grained Algorithm for Constructing Summaries of Distributed Discrete Datasets

  • Authors:
  • Jie Chi;Mehmet Koyuturk;Ananth Grama

  • Affiliations:
  • Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA;Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA;Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA

  • Venue:
  • Algorithmica
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present a coarse-grained parallel algorithm, CONQUEST, for constructing bounded-error summaries of high-dimensional binary attributed data in a distributed environment. Such summaries enable more expensive analysis techniques to be applied efficiently under constraints on computation, communication, and privacy with little loss in accuracy. While the discrete and high-dimensional nature of the dataset makes the problem difficult in its serial formulation, the loose-coupling of distributed servers hosting the data and the heterogeneity in network bandwidth present additional challenges. CONQUEST is based on a novel linear algebraic tool, PROXIMUS, which is shown to be highly effective on a serial platform. In contrast to traditional fine-grained parallel techniques that distribute the kernel operations, CONQUEST adopts a coarse-grained parallel formulation that relies on the principle of sampling to reduce communication overhead while maintaining high accuracy. Specifically, each individual site computes its local patterns independently. Various sites cooperate in dynamically orchestrated work groups to construct consensus patterns from these local patterns. Individual sites may then decide to continue their participation in the consensus or leave the group. Such parallel formulation implicitly resolves load-balancing and privacy issues while reducing communication volume significantly. Experimental results on an Intel Xeon cluster demonstrate that this strategy is capable of excellent performance in terms of compression time, ratio, and accuracy with respect to post-processing tasks.