Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

  • Authors:
  • Adnan Agbaria;Roy Friedman

  • Affiliations:
  • -;-

  • Venue:
  • HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper reports on the architecture and design of {\em Starfish}, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations.Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well.Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.