HeNCE: a heterogenous network computing environment
Scientific Programming
ACM SIGOPS Operating Systems Review
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Case for Two-Level Recovery Schemes
IEEE Transactions on Computers
Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
SFT: a consistent checkpointing algorithm with shorter freezing time
ACM SIGOPS Operating Systems Review
SCR algorithm: saving/restoring states of file systems
ACM SIGOPS Operating Systems Review
The Journal of Supercomputing
Process Interconnection Structures in Dynamically Changing Topologies
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Fault-Tolerant Parallel Applications Using Queues and Actions
ICPP '97 Proceedings of the international Conference on Parallel Processing
Transparent Orthogonal Checkpointing through User-Level Pagers
POS-9 Revised Papers from the 9th International Workshop on Persistent Object Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Parallel processing with windows NT networks
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Transparent parallel checkpointing and migration in clusters and ClusterGrids
International Journal of Computational Science and Engineering
Fault-tolerant dynamic job scheduling policy
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 0.00 |
Many scientific problems benefit from computations that are parallel at a coarse grain. Collections loosely-coupled, heterogeneous computers are increasingly being applied to these problems. While individual computers are designed to be relatively reliable, a collection of several autonomous machines necessarily has a greater rate of failure. As data networks improve, and larger multicomputers are being used, rates of failure will increase. PVM (Parallel Virtual Machine) is a popular software framework that facilitates message-passing network programming. We present enhancements to PVM to mask fail-stop, single-node failures from the application. Fail-safe PVM uses checkpoint and rollback to recover from such failures. Both checkpoints and rollbacks are transparent to the application if the application does not depend on real-time events. Recovery occurs without wait for repair of the failed computer. The system does not rely on shared stable storage and does not require modifications to the operating system. We describe the design and implementation of fail-safe PVM, present measurements of checkpoint costs, and briefly discuss shortcomings and potential avenues for improvement.