Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

  • Authors:
  • Juan Leon;Allan L. Fisher;Peter Steenkiste

  • Affiliations:
  • -;-;-

  • Venue:
  • Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
  • Year:
  • 1993

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many scientific problems benefit from computations that are parallel at a coarse grain. Collections loosely-coupled, heterogeneous computers are increasingly being applied to these problems. While individual computers are designed to be relatively reliable, a collection of several autonomous machines necessarily has a greater rate of failure. As data networks improve, and larger multicomputers are being used, rates of failure will increase. PVM (Parallel Virtual Machine) is a popular software framework that facilitates message-passing network programming. We present enhancements to PVM to mask fail-stop, single-node failures from the application. Fail-safe PVM uses checkpoint and rollback to recover from such failures. Both checkpoints and rollbacks are transparent to the application if the application does not depend on real-time events. Recovery occurs without wait for repair of the failed computer. The system does not rely on shared stable storage and does not require modifications to the operating system. We describe the design and implementation of fail-safe PVM, present measurements of checkpoint costs, and briefly discuss shortcomings and potential avenues for improvement.