Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Parallel Checkpoint/Restart without Message Logging
ICPP '00 Proceedings of the 2000 International Workshop on Parallel Processing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
ACM Computing Surveys (CSUR)
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Journal of Parallel and Distributed Computing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Redesigning the message logging model for high performance
Concurrency and Computation: Practice & Experience - International Supercomputing Conference
Dodging the cost of unavoidable memory copies in message logging protocols
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Correlated set coordination in fault tolerant message logging protocols
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Replication for send-deterministic MPI HPC applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Current fault tolerance protocols are not sufficiently scalable for the exascale era. The most-widely used method, coordinated checkpointing, places enormous demands on the I/O subsystem and imposes frequent synchronizations. Uncoordinated protocols use message logging which introduces message rate limitations or undesired memory and storage requirements to hold payload and event logs. In this paper we propose a combination of several techniques, namely coordinated checkpointing, optimistic message logging, and a protocol that glues them together. This combination eliminates some of the drawbacks of each individual approach and proves to be an alternative for many types of exascale applications. We evaluate performance and scaling characteristics of this combination using simulation and a partial implementation. While not a universal solution, the combined protocol is suitable for a large range of existing and future applications that use coordinated checkpointing and enhances their scalability.