Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
The Hector Distributed Run-Time Environment
IEEE Transactions on Parallel and Distributed Systems
A checkpointing strategy for scalable recovery on distributed parallel systems
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Hector: Automated Task Allocation for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Adaptive Scheduling for Task Farming with Grid Middleware
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Adaptive Scheduling for Task Farming with Grid Middleware
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
This paper presents a checkpointing scheme that was implemented in a parallel library that runs on top of CHIMP/MPI. The main goals of the checkpointing mechanism are portability and efficiency. It runs on every platform supported by MPI in a machine-independent way. The scheme allows the migration of checkpoints and offers a flexible recovery mechanism based on data-reconfiguration. Some performance results will be presented at the end of the paper together with some techniques that can be used to increase the efficiency of the checkpointing mechanism.