Compiler-Enhanced Incremental Checkpointing

  • Authors:
  • Greg Bronevetsky;Daniel Marques;Keshav Pingali;Radu Rugina

  • Affiliations:
  • Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, , Livermore, USA CA 94551;Department of Computer Sciences, The University of Texas at Austin, Austin, USA TX 78712;Department of Computer Sciences, The University of Texas at Austin, Austin, USA TX 78712;Department of Computer Science, Cornell University, Ithaca, USA NY 14850

  • Venue:
  • Languages and Compilers for Parallel Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures in that it allows applications to periodically save their state and restart the computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains by far the most popular approach because of its superior performance. This paper focuses on improving the performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis is shown to significantly reduce checkpoint sizes (upto 78%) and to enable asynchronous checkpointing.