Optimizing Checkpoint Sizes in the C3 System

  • Authors:
  • Daniel Marques;Greg Bronevetsky;Rohit Fernandes;Keshav Pingali;Paul Stodghil

  • Affiliations:
  • Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY

  • Venue:
  • IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

The running times of many computational science applications are much longer than the mean-time-between-failures (MTBF) of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform amd cannot optimize the checkpointing process using application-specific knowledge. We are exploring an alternative called automatic applicationlevel checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform. In this paper, we evaluate a mechanism that utilizes application knowledge to minimize the amount of information saved in a checkpoint.