Achieving Robustness and Minimizing Overhead in Parallel Algorithms Through Overlapped Communication/Computation

  • Authors:
  • Arun K. Somani;Allen M. Sansano

  • Affiliations:
  • Electrical and Computer Engineering, Iowa State University, Ames, IA 50011-3060 arun@iastate.edu;C-Cube Microsystems, 1778 McCarthy Blvd, Milpitas, CA 95035 asansano@c-cube.com

  • Venue:
  • The Journal of Supercomputing - Special issue on embedded fault-tolerance systems
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the major goals in the design of parallel processing machines and algorithms is to achieve robustness and reduce the effects of the overhead introduced when a given problem is parallelized or a fault occurs. A key contributor to overhead is communication time, in particular when a node is faulty and another node is substuiting for its operation. Many architectures try to reduce this overhead by minimizing the actual time for a communication to occur, including latency and bandwidth figures. Another approach is to hide communication by overlapping it with computation assuming that the computation is the most prominent factor. This paper presents the mechanisms provided in the Proteus parallel computer and its effective use of communication hiding through overlapping communication/computation techniques with and without the presence of a fault. These techniques are easily extended for use in compiler support of parallel programming. We also address the complexity (or rather simplicity) in achieving complete exchange on the Proteus Machine.