Checkpointing strategies for parallel jobs

Authors:
Marin Bougeret;Henri Casanova;Mikael Rabie;Yves Robert;Frédéric Vivien
Affiliations:
ENS Lyon, France;Univ. of Hawai'i at Mānoa, Honolulu;ENS Lyon, France;ENS Lyon, France;INRIA, Lyon, France
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 21
Cited 3

On the optimum checkpoint selection problem

SIAM Journal on Computing
Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems

IEEE Transactions on Computers - Fault-Tolerant Computing
ScaLAPACK user's guide

ScaLAPACK user's guide
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
A first order approximation to the optimum checkpoint interval

Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Modeling Coordinated Checkpointing for Large-Scale Supercomputers

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

IEEE Transactions on Dependable and Secure Computing
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community

International Journal of High Performance Computing Applications
Proactive management of software aging

IBM Journal of Research and Development
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing vs. Migration for Post-Petascale Supercomputers

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing

The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the first rigorous proof that periodic checkpointing is optimal. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the expected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We first perform extensive simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems. The obtained results not only corroborate our theoretical findings, but also show that our dynamic programming algorithm significantly outperforms previously proposed solutions in the case of Weibull failures. We then discuss results from simulation experiments that use failure logs from production clusters. These results confirm that our dynamic programming algorithm significantly outperforms existing solutions for real-world clusters.