Supporting reverse execution for parallel programs
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Balancing runtime and replay costs in a trace-and-replay system
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
Distributed Algorithms
The MOSIX Distributed Operating System: Load Balancing for UNIX
The MOSIX Distributed Operating System: Load Balancing for UNIX
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A component architecture for LAM/MPI (citation_only)
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Compiler-Assisted Checkpointing
Compiler-Assisted Checkpointing
Guaranteed-quality parallel Delaunay refinement for restricted polyhedral domains
Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometrySoCG2002
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Optimizing Checkpoint Sizes in the C3 System
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Mobile MPI programs in computational grids
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic application-level checkpointing for high performance computing systems
Automatic application-level checkpointing for high performance computing systems
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Cluster Computing
Live migration of virtual machine based on full system trace and replay
Proceedings of the 18th ACM international symposium on High performance distributed computing
An evaluation of checkpoint recovery for massively multiplayer online games
Proceedings of the VLDB Endowment
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach
International Journal of Communication Networks and Distributed Systems
Hi-index | 0.00 |
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of the uses of checkpointing is to help mitigate the effects of interruptions in computational service (both planned and unplanned) In fact, some supercomputing centers expect their users to use checkpointing as a matter of policy. And yet, few centers provide fully automatic checkpointing systems for their high-end production machines. The paper is a status report on our work on the family of C3 systems for (almost) fully automatic checkpointing for scientific applications. To date, we have shown that our techniques can be used for checkpointing sequential, MPI and OpenMP applications written in C, Fortran, and several other languages. A novel aspect of our work is that we have not built a single checkpointing system, rather, we have developed a methodology and a set of techniques that have enabled us to develop a number of systems, each meeting different design goals and efficiency requirements.