Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
IEEE Transactions on Parallel and Distributed Systems
The Hector Distributed Run-Time Environment
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Application Recovery in Parallel Programming Environment
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
Integrating fault-tolerance techniques in grid applications
Integrating fault-tolerance techniques in grid applications
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
SHIELD: a fault-tolerant MPI for an infiniband cluster
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Hi-index | 0.00 |
This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.