A flexible checkpoint/restart model in distributed systems

Authors:
Mohamed-Slim Bouguerra;Thierry Gautier;Denis Trystram;Jean-Marc Vincent
Affiliations:
Grenoble University, Montbonnot Saint Martin, France;INRIA Rhone-Alpes, Saint Ismier, France;Grenoble University, Montbonnot Saint Martin, France;Grenoble University, Montbonnot Saint Martin, France
Venue:
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Year:
2009

Citing 10
Cited 8

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A first order approximation to the optimum checkpoint interval

Communications of the ACM
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems

Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpoint scheduling model for optimality

Information Processing Letters
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
A Bug Locating Method for the Debugging of Parallel Discrete Event Simulation

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

The Journal of Supercomputing
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing
Optimization of cloud task processing with checkpoint-restart mechanism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale applications running on new computing platforms with thousands of processors have to face with reliability problems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of processes and the number of computational resources. Through mathematical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint times in order to minimize the average completion time. Model independency from the type of the failure law makes it completely flexible. We show that such a model may be used to reduce the checkpoint rate up to 20% in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson's process and Weibull's law.