Efficient collective communication in distributed heterogeneous systems

  • Authors:
  • Prashanth B. Bhat;C. S. Raghavendra;Viktor K. Prasanna

  • Affiliations:
  • Department of EE-Systems, EEB 246, University of Southern California, Los Angeles, CA;The Aerospace Corporation, Department of EE-Systems, University of Southern California, Los Angeles, CA;Department of EE-Systems, EEB 200C, University of Southern California, Los Angeles, CA

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

With recent advances in high-speed networks, distributed heterogeneous computing has emerged as an attractive computational paradigm. Wide-area grid infrastructures will enable distributed applications-such as video conferencing and distributed interactive simulation--to seamlessly integrate collections of heterogeneous workstations, multiprocessors, and mobile nodes. The underlying network is typically a collection of several heterogeneous links, of different networking technologies. Such a heterogeneous network is also typical in local area workstation clusters, which are increasingly being used as alternatives to parallel computing systems. This paper introduces a framework for developing efficient collective communication schedules over such heterogeneous networks. We focus on application-level communication, between processes of a parallel program. Our framework consists of analytical models of the heterogeneous system, scheduling algorithms for the collective communication pattern, and performance evaluation mechanisms. We show that previous models, which considered node heterogeneity but ignored network heterogeneity, can lead to solutions which are worse than the optimal by an unbounded factor. We then introduce an enhanced communication model, and develop three heuristic algorithms for the broadcast and multicast patterns. The completion time of the schedule is chosen as the performance metric. The heuristic algorithms are fastest edge first (FEF), earliest completing edge first (ECEF), and ECEF with look-ahead. For small system sizes, we find the optimal solution using exhaustive search. Our simulation experiments indicate that the performance of our heuristic algorithms is close to optimal. For performance evaluation of larger systems, we have also developed a simple lower bound on the completion time. Our heuristic algorithms achieve significant performance improvements over previous approaches.