On the optimum checkpoint selection problem
SIAM Journal on Computing
Calculating Cumulative Operational Time Distributions of Repairable Computer Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
Optimal checkpointing of real-time tasks
IEEE Transactions on Computers
Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems
IEEE Transactions on Computers - Fault-Tolerant Computing
Optimum checkpoints with age dependent failures
Acta Informatica
Comparative Analysis of Different Models of Checkpointing and Recovery
IEEE Transactions on Software Engineering
On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems
IEEE Transactions on Software Engineering
A stochastic checkpoint optimization problem
SIAM Journal on Computing
IEEE Transactions on Parallel and Distributed Systems
Processor Shadowing: Maximizing Expected Throughput in Fault-Tolerant Systems
Mathematics of Operations Research
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
Optimization criteria for checkpoint placement
Communications of the ACM
Optimization criteria for checkpoint placement
Communications of the ACM
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Stochastic analysis of distributed deadlock scheduling
Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle
IEEE Transactions on Dependable and Secure Computing
On Optimal Deadlock Detection Scheduling
IEEE Transactions on Computers
Failure-aware checkpointing in fine-grained cycle sharing systems
Proceedings of the 16th international symposium on High performance distributed computing
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems
International Journal of Parallel, Emergent and Distributed Systems
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
Proceedings of the 2009 workshop on Resiliency in high performance
Optimal checkpointing interval for two-level recovery schemes
Computers & Mathematics with Applications
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
On checkpoint overhead in distributed systems providing session guarantees
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Analysis of a software system with rejuvenation, restoration and checkpointing
ISAS'08 Proceedings of the 5th international conference on Service availability
File fragmentation over an unreliable channel
INFOCOM'10 Proceedings of the 29th conference on Information communications
Journal of Systems and Software
A fault-tolerance architecture for Kepler-based distributed scientific workflows
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Checkpoint scheduling model for optimality
Information Processing Letters
On the checkpointing strategy in desktop grids
IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
International Journal of Security and Networks
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
The Journal of Supercomputing
Hi-index | 14.98 |
Checkpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery. Theoretical result shows that the optimal checkpointing frequency is proportional to the square root of the failure rate and can be uniquely determined by the failure rate (time-varying or constant) if the recovery function is strictly increasing and the failure rate is $\lambda (\infty ) 0$. Bruno and Coffman [2] suggest that optimal checkpointing by its nature is a function of system failure rate, i.e., the time-varying failure rate demands time-varying checkpointing in order to meet the criteria of certain optimality. The results obtained in this paper agree with their viewpoint.