Compiler-Enhanced Incremental Checkpointing

Authors:
Greg Bronevetsky;Daniel Marques;Keshav Pingali;Radu Rugina
Affiliations:
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, , Livermore, USA CA 94551;Department of Computer Sciences, The University of Texas at Austin, Austin, USA TX 78712;Department of Computer Sciences, The University of Texas at Austin, Austin, USA TX 78712;Department of Computer Science, Cornell University, Ithaca, USA NY 14850
Venue:
Languages and Compilers for Parallel Computing
Year:
2007

Citing 5
Cited 0

An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Efficient application migration under compiler guidance

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures in that it allows applications to periodically save their state and restart the computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains by far the most popular approach because of its superior performance. This paper focuses on improving the performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis is shown to significantly reduce checkpoint sizes (upto 78%) and to enable asynchronous checkpointing.