Direct bulk-synchronous parallel algorithms
Journal of Parallel and Distributed Computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Distributed Algorithms
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Libckpt: Transparent Checkpointing under Unix
Libckpt: Transparent Checkpointing under Unix
Compiler-Assisted Checkpointing
Compiler-Assisted Checkpointing
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Finding your cronies: static analysis for dynamic object colocation
OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.01 |
The running times of many computational science applications are much longer than the mean-time-between-failures (MTBF) of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform amd cannot optimize the checkpointing process using application-specific knowledge. We are exploring an alternative called automatic applicationlevel checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform. In this paper, we evaluate a mechanism that utilizes application knowledge to minimize the amount of information saved in a checkpoint.