Compiler-generated staggered checkpointing
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Compiler-support for robust multi-core computing
ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
Hi-index | 0.00 |
Checkpointing is a key technology for applications on large cluster computer systems.As cluster sizes grow, component failures will become a normal part of operation, and applications will have to deal more directly with repeated failures during program runs. In this paper, we describe automatic checkpointing in the ZPL compiler and its advantages over traditional library-or system-based approaches that have no information about application behavior. We show that even naive compiler-inserted checkpoints can significantly reduce the size of the checkpoint recovery data, up to 73% in our application suite. We also introduce the notionof checkpoint ranges, a range of code where processors can perform a local checkpoint at any time during the range. The compiler guarantees that these local checkpoints form a globally consistent checkpoint without global coordination by ensuring that there are noin- flight messages during the checkpoint range. Checkpoint ranges help further alleviate any additional network congestion caused by checkpointing.