A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Distributed Algorithms
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A User-level Checkpointing Library for POSIX Threads Programs
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Compiler-Assisted Checkpointing
Compiler-Assisted Checkpointing
Optimizing Checkpoint Sizes in the C3 System
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
AtomCaml: first-class atomicity via rollback
Proceedings of the tenth ACM SIGPLAN international conference on Functional programming
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Practical dynamic software updating for C
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Automatic logging of operating system effects to guide application-level architecture simulation
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Modular Checkpointing for Atomicity
Electronic Notes in Theoretical Computer Science (ENTCS)
Research on Dynamic Updating of Grid Service
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Raising the level of abstraction of application-level checkpointing
Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
A Domain-Specific Language for Application-Level Checkpointing
ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Adapting Application Mapping to Systematic Within-Die Process Variations on Chip Multiprocessors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Developing scientific applications using Generative Programming
SECSE '09 Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Lightweight checkpointing for concurrent ml
Journal of Functional Programming
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Mementos: system support for long-running computation on RFID-scale devices
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Fast checkpoint recovery algorithms for frequently consistent applications
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
BRRL: a recovery library for main-memory applications in the cloud
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A technique for non-invasive application-level checkpointing
The Journal of Supercomputing
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A work-stealing scheduling framework supporting fault tolerance
Proceedings of the Conference on Design, Automation and Test in Europe
Reli: hardware/software checkpoint and recovery scheme for embedded processors
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Specification and Verification of Concurrent Programs Through Refinements
Journal of Automated Reasoning
Hi-index | 0.00 |
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.