A Variational Calculus Approach to Optimal Checkpoint Placement

Authors:
Yibei Ling;Jie Mi;Xiaola Lin
Affiliations:
-;-;-
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 14
Cited 25

On the optimum checkpoint selection problem

SIAM Journal on Computing
Calculating Cumulative Operational Time Distributions of Repairable Computer Systems

IEEE Transactions on Computers - The MIT Press scientific computation series
Optimal checkpointing of real-time tasks

IEEE Transactions on Computers
Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems

IEEE Transactions on Computers - Fault-Tolerant Computing
Optimum checkpoints with age dependent failures

Acta Informatica
Comparative Analysis of Different Models of Checkpointing and Recovery

IEEE Transactions on Software Engineering
On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems

IEEE Transactions on Software Engineering
A stochastic checkpoint optimization problem

SIAM Journal on Computing
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Processor Shadowing: Maximizing Expected Throughput in Fault-Tolerant Systems

Mathematics of Operations Research
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
Optimization criteria for checkpoint placement

Communications of the ACM
Optimization criteria for checkpoint placement

Communications of the ACM
A first order approximation to the optimum checkpoint interval

Communications of the ACM

MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

Cluster Computing
Stochastic analysis of distributed deadlock scheduling

Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

IEEE Transactions on Dependable and Secure Computing
On Optimal Deadlock Detection Scheduling

IEEE Transactions on Computers
Failure-aware checkpointing in fine-grained cycle sharing systems

Proceedings of the 16th international symposium on High performance distributed computing
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems

International Journal of Parallel, Emergent and Distributed Systems
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Proceedings of the 2009 workshop on Resiliency in high performance
Optimal checkpointing interval for two-level recovery schemes

Computers & Mathematics with Applications
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science
On checkpoint overhead in distributed systems providing session guarantees

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Analysis of a software system with rejuvenation, restoration and checkpointing

ISAS'08 Proceedings of the 5th international conference on Service availability
File fragmentation over an unreliable channel

INFOCOM'10 Proceedings of the 29th conference on Information communications
Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Journal of Systems and Software
A fault-tolerance architecture for Kepler-based distributed scientific workflows

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Checkpoint scheduling model for optimality

Information Processing Letters
On the checkpointing strategy in desktop grids

IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
Modelling and evaluating a high serviceability fault tolerance strategy in cloud computing environments

International Journal of Security and Networks
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

The Journal of Supercomputing

Quantified Score

Hi-index	14.98

Visualization

Abstract

Checkpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery. Theoretical result shows that the optimal checkpointing frequency is proportional to the square root of the failure rate and can be uniquely determined by the failure rate (time-varying or constant) if the recovery function is strictly increasing and the failure rate is $\lambda (\infty ) 0$. Bruno and Coffman [2] suggest that optimal checkpointing by its nature is a function of system failure rate, i.e., the time-varying failure rate demands time-varying checkpointing in order to meet the criteria of certain optimality. The results obtained in this paper agree with their viewpoint.