The Sprite Network Operating System
Computer
Compiler-assisted full checkpointing
Software—Practice & Experience
ANTLR: a predicated-LL(k) parser generator
Software—Practice & Experience
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms
HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Controller/Precompiler for Portable Checkpointing
IEICE - Transactions on Information and Systems
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
XARK: An extensible framework for automatic recognition of computational kernels
ACM Transactions on Programming Languages and Systems (TOPLAS)
CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Analysis of Performance-impacting Factors on Checkpointing Frameworks
The Computer Journal
Hi-index | 0.00 |
With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.