Application-level checkpointing for shared memory programs

Authors:
Greg Bronevetsky;Daniel Marques;Keshav Pingali;Peter Szwed;Martin Schulz
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;University of California, Livermore, CA
Venue:
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Year:
2004

Citing 17
Cited 25

A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Distributed Algorithms

Distributed Algorithms
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A User-level Checkpointing Library for POSIX Threads Programs

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing

Optimizing Checkpoint Sizes in the C3 System

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
AtomCaml: first-class atomicity via rollback

Proceedings of the tenth ACM SIGPLAN international conference on Functional programming
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Practical dynamic software updating for C

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Automatic logging of operating system effects to guide application-level architecture simulation

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Stabilizers: a modular checkpointing abstraction for concurrent functional programs

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Modular Checkpointing for Atomicity

Electronic Notes in Theoretical Computer Science (ENTCS)
Research on Dynamic Updating of Grid Service

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Raising the level of abstraction of application-level checkpointing

Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
A Domain-Specific Language for Application-Level Checkpointing

ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Adapting Application Mapping to Systematic Within-Die Process Variations on Chip Multiprocessors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Developing scientific applications using Generative Programming

SECSE '09 Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Lightweight checkpointing for concurrent ml

Journal of Functional Programming
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Mementos: system support for long-running computation on RFID-scale devices

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Fast checkpoint recovery algorithms for frequently consistent applications

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
BRRL: a recovery library for main-memory applications in the cloud

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A technique for non-invasive application-level checkpointing

The Journal of Supercomputing
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Software persistent memory

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A work-stealing scheduling framework supporting fault tolerance

Proceedings of the Conference on Design, Automation and Test in Europe
Reli: hardware/software checkpoint and recovery scheme for embedded processors

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Specification and Verification of Concurrent Programs Through Refinements

Journal of Automated Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.