Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Authors:
Juan Leon;Allan L. Fisher;Peter Steenkiste
Affiliations:
-;-;-
Venue:
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Year:
1993

Citing 0
Cited 21

HeNCE: a heterogenous network computing environment

Scientific Programming
Load balancing and fault tolerance in workstation clusters migrating groups of communicating processes

ACM SIGOPS Operating Systems Review
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Case for Two-Level Recovery Schemes

IEEE Transactions on Computers
Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
SFT: a consistent checkpointing algorithm with shorter freezing time

ACM SIGOPS Operating Systems Review
SCR algorithm: saving/restoring states of file systems

ACM SIGOPS Operating Systems Review
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Process Interconnection Structures in Dynamically Changing Topologies

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Fault-Tolerant Parallel Applications Using Queues and Actions

ICPP '97 Proceedings of the international Conference on Parallel Processing
Transparent Orthogonal Checkpointing through User-Level Pagers

POS-9 Revised Papers from the 9th International Workshop on Persistent Object Systems
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Parallel processing with windows NT networks

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Transparent parallel checkpointing and migration in clusters and ClusterGrids

International Journal of Computational Science and Engineering
Fault-tolerant dynamic job scheduling policy

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many scientific problems benefit from computations that are parallel at a coarse grain. Collections loosely-coupled, heterogeneous computers are increasingly being applied to these problems. While individual computers are designed to be relatively reliable, a collection of several autonomous machines necessarily has a greater rate of failure. As data networks improve, and larger multicomputers are being used, rates of failure will increase. PVM (Parallel Virtual Machine) is a popular software framework that facilitates message-passing network programming. We present enhancements to PVM to mask fail-stop, single-node failures from the application. Fail-safe PVM uses checkpoint and rollback to recover from such failures. Both checkpoints and rollbacks are transparent to the application if the application does not depend on real-time events. Recovery occurs without wait for repair of the failed computer. The system does not rely on shared stable storage and does not require modifications to the operating system. We describe the design and implementation of fail-safe PVM, present measurements of checkpoint costs, and briefly discuss shortcomings and potential avenues for improvement.