Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems
IEEE Transactions on Computers - Fault-Tolerant Computing
An On-Line Algorithm for Checkpoint Placement
IEEE Transactions on Computers
Performance Optimization of Checkpointing Schemes with Task Duplication
IEEE Transactions on Computers
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
On the Optimum Checkpoint Interval
Journal of the ACM (JACM)
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
Optimization criteria for checkpoint placement
Communications of the ACM
Optimization criteria for checkpoint placement
Communications of the ACM
Performance of rollback recovery systems under intermittent failures
Communications of the ACM
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
A model of roll-back recovery with multiple checkpoints
ICSE '76 Proceedings of the 2nd international conference on Software engineering
BOINC: A System for Public-Resource Computing and Storage
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
On the Scheduling of Checkpoints in Desktop Grids
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
IEEE Transactions on Parallel and Distributed Systems
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
Checkpointing is an effective measure to ensure the completion of long-running jobs in Desktop Grids which are subject to frequent resource failures. We focus on checkpointing strategies in the context of Desktop Grids, including volunteer computing systems, where individual hosts follow diverse failure distributions. We propose an algorithm which computes sequence of checkpoint interval lengths for each individual host according to a sample of its availability interval lengths. This algorithm directly approximates the probability distribution of availability interval lengths with the sample, without deriving a closed form of the probability distribution. Through simulations with synthetic trace data and trace data from real volunteer computing project, this sample based strategy shows better performance than periodic strategy in terms of wasted time in most cases.