Automatic application-level checkpointing for high performance computing systems

  • Authors:
  • K. Pingali;Daniel J. Marques

  • Affiliations:
  • Cornell University;Cornell University

  • Venue:
  • Automatic application-level checkpointing for high performance computing systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

For high-performance computational science, the changing nature of the computational infrastructure is leading to higher failure rates, so it is becoming more important for production systems to be fault-tolerant. Typically, on high performance computing systems (HPC), fault-tolerance is provided by a checkpoint and restart (CPR) mechanism. It is our belief that, neither of the two existing styles of CPR, system-level and application-level checkpointing, are well suited for the current and future generations of HPC systems. System-level checkpointing mechanisms (SLC), which are either implemented as part of the operating system or in a user space library, are very easy to use but are not available on all platforms, and developing one or porting one to a new platform is a very meticulous task. The developers of SLC do not "provided any guarantees" that their system will work with even slightly different versions of a platform's compiler, linker, or C library. Application-level checkpointing (ALC) mechanisms are implemented directly in the source code of the application, making the checkpointing code work anywhere the application can run. However, implementing ALC requires the application programmer to devote a "huge effort" to both program and debug the checkpointing routines.This dissertation proposes Automatic Application-Level Checkpointing (AALC) as a way to achieve the works everywhere flexibility of ALC with the ease of use associated with using SLC. There is a spectrum of solutions that could be described as AALC: the most appropriate solution for a particular domain depends on the application programmer's needs and on limitations imposed by the machine(s) he plans to run his code on. In this dissertation we will examine two points along that spectrum, two different approaches to AALC, logged execution and managed execution. We will show that each approach has overheads similar to a widely used and well supported SLC mechanism, and discuss how they were implemented and how they differ from each other. We will also discuss how AALC could be expanded to optimize checkpoint size and reduce the overhead of checkpointing.