Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Performance Analysis of the Checkpoint-Rollback-Recovery System via Diffusion Approximation
Proceedings of the International Workshop on Computer Performance and Reliability
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle
IEEE Transactions on Dependable and Secure Computing
Optimal Checkpoint Placement with Equality Constraints
DASC '06 Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Hi-index | 0.89 |
To minimize the expected execution time, a general checkpoint scheduling algorithm is proposed to determine the near optimal checkpointing time sequence. More precisely, based on a simple timing policy, an execution analytical model is introduced and the expected effective ratio is derived. By maximizing the expected effective ratio, the optimal checkpoint period for the exponential failure distribution can be obtained directly, and a general checkpoint scheduling algorithm is developed to perform the near optimal checkpointing time sequence for an arbitrary failure distribution. Experimental results reveal that the proposal can perform varying checkpoint interval according to the failure distribution and the expected effective ratio of the execution is considerable for the long-running application in term of reliability.