SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

Authors:
Thomas Ropars;Tatiana V. Martsinkevich;Amina Guermouche;André Schiper;Franck Cappello
Affiliations:
École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland;INRIA, University of Paris Sud, Paris, France;Université de Versailles, Versailles, France;École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland;Argonne National Laboratory, Argonne
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 24
Cited 0

Checkpointing and rollback-recovery for distributed systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
BoomerAMG: a parallel algebraic multigrid solver and preconditioner

Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Patterns for parallel programming

Patterns for parallel programming
An assumed partition algorithm for determining processor inter-communication

Parallel Computing
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Redesigning the message logging model for high performance

Concurrency and Computation: Practice & Experience - International Supercomputing Conference
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Enabling concurrent multithreaded MPI communication on multicore petascale systems

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Active optimistic and distributed message logging for message-passing applications

Concurrency and Computation: Practice & Experience
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.