CPPC-G: fault-tolerant applications on the grid
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience
International Journal of Parallel Programming
Hi-index | 0.00 |
This paper presents CPPC (Controller/Precompiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and Grid infrastructures through the use of portable protocols, portable checkpoint files and portable code. It works at variable level being user-directed, thus generating small checkpoint files. It allows parallel processes to checkpoint independently, without runtime coordination or message-logging. Consistency is achieved at restart time by negotiating the restart point. A directive-based checkpointing precompiler has also been implemented to ease up user's effort. CPPC was designed to work with parallel MPI programs, though it can be used with sequential ones, and easily extended to parallel programs written using different message-passing libraries, due to its highly modular design. Experimental results are shown using CPPC with different test applications.