Multi-criteria checkpointing strategies: response-time versus resource utilization

Authors:
Aurelien Bouteiller;Franck Cappello;Jack Dongarra;Amina Guermouche;Thomas Hérault;Yves Robert
Affiliations:
University of Tennessee Knoxville;University of Illinois at Urbana Champaign and INRIA, France;University of Tennessee Knoxville;Univ. Versailles St Quentin, France;University of Tennessee Knoxville;University of Tennessee Knoxville and Ecole Normale Supérieure de Lyon, France
Venue:
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Year:
2013

Citing 12
Cited 0

Reevaluating Amdahl's law

Communications of the ACM
The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

IEEE Transactions on Parallel and Distributed Systems
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community

International Journal of High Performance Computing Applications
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.