Recent advances in checkpoint/recovery systems

  • Authors:
  • Greg Bronevetsky;Rohit Fernandes;Daniel Marques;Keshav Pingali;Paul Stodghill

  • Affiliations:
  • Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY;Cornell University, Department of Computer Science, Ithaca, NY

  • Venue:
  • IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of the uses of checkpointing is to help mitigate the effects of interruptions in computational service (both planned and unplanned) In fact, some supercomputing centers expect their users to use checkpointing as a matter of policy. And yet, few centers provide fully automatic checkpointing systems for their high-end production machines. The paper is a status report on our work on the family of C3 systems for (almost) fully automatic checkpointing for scientific applications. To date, we have shown that our techniques can be used for checkpointing sequential, MPI and OpenMP applications written in C, Fortran, and several other languages. A novel aspect of our work is that we have not built a single checkpointing system, rather, we have developed a methodology and a set of techniques that have enabled us to develop a number of systems, each meeting different design goals and efficiency requirements.