Space reclamation for uncoordinated checkpointing in message-passing systems
Space reclamation for uncoordinated checkpointing in message-passing systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
The MOSIX Distributed Operating System: Load Balancing for UNIX
The MOSIX Distributed Operating System: Load Balancing for UNIX
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Libckpt: Transparent Checkpointing under Unix
Libckpt: Transparent Checkpointing under Unix
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance evaluation of adaptive MPI
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Achieving high performance on extremely large parallel machines: performance prediction and load balancing
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
Hi-index | 0.01 |
As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.