Compiler-generated staggered checkpointing

Authors:
Alison N. Norman;Sung-Eun Choi;Calvin Lin
Affiliations:
The University of Texas at Austin;Los Alamos National Laboratory;The University of Texas at Austin
Venue:
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Year:
2004

Citing 9
Cited 2

Compile-time analysis of communicating processes

ICS '92 Proceedings of the 6th international conference on Supercomputing
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Broadway: A Software Architecture for Scientific Computing

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler Support for Automatic Checkpointing

HPCS '02 Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications
On Staggered Checkpointing

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing

Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
TH-1: China's first petaflop supercomputer

Frontiers of Computer Science in China

Quantified Score

Hi-index	0.00

Visualization

Abstract

To minimize work lost due to system failures, large parallel applications perform periodic checkpoints. These checkpoints are typically inserted manually by application programmers, resulting in synchronous checkpoints, or checkpoints that occur at the same program point in all processes. While this solution is tenable for current systems, it will become problematic for future supercomputers that have many tens of thousands of nodes, because contention for both the network and file system will grow. This paper shows that staggered checkpoints---globally consistent checkpoints in which processes perform checkpoints at different points in the code---can significantly reduce network and file system contention. We describe a compiler-based approach for inserting staggered checkpoints, and we show, using trace-driven simulation, that staggered checkpointing is 23 times faster that synchronous checkpointing.