Transparent optimistic rollback recovery
ACM SIGOPS Operating Systems Review
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
A network-failure-tolerant message-passing system for terascale clusters
ICS '02 Proceedings of the 16th international conference on Supercomputing
Distributed Algorithms
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Libckpt: Transparent Checkpointing under Unix
Libckpt: Transparent Checkpointing under Unix
Compiler-Assisted Checkpointing
Compiler-Assisted Checkpointing
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Optimizing Checkpoint Sizes in the C3 System
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Compiler-generated staggered checkpointing
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Mobile MPI programs in computational grids
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
International Journal of High Performance Computing Applications
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
ACM SIGOPS Operating Systems Review
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Implementing fault-tolerance in real-time systems by automatic program transformations
EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Scalable, fault tolerant membership for MPI tasks on HPC systems
Proceedings of the 20th annual international conference on Supercomputing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Quasi-atomic recovery for distributed agents
Parallel Computing
Modular Checkpointing for Atomicity
Electronic Notes in Theoretical Computer Science (ENTCS)
Compensation of Measurement Overhead in Parallel Performance Profiling
International Journal of High Performance Computing Applications
HySim: a fast simulation framework for embedded software development
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
A fast and generic hybrid simulation approach using C virtual machine
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Automated application-level checkpointing based on live-variable analysis in MPI programs
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Fault-tolerant stream processing using a distributed, replicated file system
Proceedings of the VLDB Endowment
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
A proposal for error handling in OpenMP
International Journal of Parallel Programming
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
A novel fault-tolerant parallel algorithm
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Lightweight checkpointing for concurrent ml
Journal of Functional Programming
A proposal for error handling in OpenMP
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
BRRL: a recovery library for main-memory applications in the cloud
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
An effective speedup metric for measuring productivity in large-scale parallel computer systems
The Journal of Supercomputing
A technique for non-invasive application-level checkpointing
The Journal of Supercomputing
New user-guided and ckpt-based checkpointing libraries for parallel MPI applications,
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Models for on-the-fly compensation of measurement overhead in parallel performance profiling
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A hybrid message Logging-CIC protocol for constrained checkpointability
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Execution migration in a heterogeneous-ISA chip multiprocessor
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Programming model support for dependable, elastic cloud applications
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.