Alleviating scalability issues of checkpointing protocols

Authors:
Rolf Riesen;Kurt Ferreira;Dilma Da Silva;Pierre Lemarinier;Dorian Arnold;Patrick G. Bridges
Affiliations:
IBM Research - Ireland;Sandia National Laboratories, Albuquerque;IBM Research - Watson;IBM Research - Ireland;University of New Mexico, Albuquerque;University of New Mexico, Albuquerque
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 26
Cited 2

Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Parallel Checkpoint/Restart without Message Logging

ICPP '00 Proceedings of the 2000 International Workshop on Parallel Processing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimistic replication

ACM Computing Surveys (CSUR)
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Coordinated checkpoint versus message log for fault tolerant MPI

International Journal of High Performance Computing and Networking
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Redesigning the message logging model for high performance

Concurrency and Computation: Practice & Experience - International Supercomputing Conference
Dodging the cost of unavoidable memory copies in message logging protocols

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Replication for send-deterministic MPI HPC applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current fault tolerance protocols are not sufficiently scalable for the exascale era. The most-widely used method, coordinated checkpointing, places enormous demands on the I/O subsystem and imposes frequent synchronizations. Uncoordinated protocols use message logging which introduces message rate limitations or undesired memory and storage requirements to hold payload and event logs. In this paper we propose a combination of several techniques, namely coordinated checkpointing, optimistic message logging, and a protocol that glues them together. This combination eliminates some of the drawbacks of each individual approach and proves to be an alternative for many types of exascale applications. We evaluate performance and scaling characteristics of this combination using simulation and a partial implementation. While not a universal solution, the combined protocol is suitable for a large range of existing and future applications that use coordinated checkpointing and enhances their scalability.