Experimental evaluation of application-level checkpointing for OpenMP programs

Authors:
Greg Bronevetsky;Keshav Pingali;Paul Stodghill
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
Proceedings of the 20th annual international conference on Supercomputing
Year:
2006

Citing 14
Cited 5

A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A User-level Checkpointing Library for POSIX Threads Programs

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Mobile MPI programs in computational grids

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing

Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Using dynamic task level redundancy for OpenMP fault tolerance

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.01

Visualization

Abstract

It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the computation's state is saved periodically to disk. Upon failure the computation is restarted from the last saved state. The common CPR mechanism, called System-level Checkpointing (SLC), requires modifying the Operating System and the communication libraries to enable them to save the state of the entire parallel application. This approach is not portable since a checkpointer for one system rarely works on another. Application-level Checkpointing (ALC) is a portable alternative where the programmer manually modifies their program to enable CPR, a very labor-intensive task.We are investigating the use of compiler technology to instrument codes to embed the ability to tolerate faults into applications themselves, making them self-checkpointing and self-restarting on any platform. In [9] we described a general approach for checkpointing shared memory APIs at the application level. Since [9] applied to only a toy feature set common to most shared memory APIs, this paper shows the practicality of this approach by extending it to a specific popular shared memory API: OpenMP. We describe the challenges involved in providing automated ALC for OpenMP applications and experimentally validate this approach by showing detailed performance results for our implementation of this technique. Our experiments with the NAS OpenMP benchmarks [1] and the EPCC microbench-marks [21] show generally low overhead on three different architectures: Linux/IA64, Tru64/Alpha and Solaris/Sparc and highlight important lessons about the performance characteristics of this aproach.