On the optimum checkpoint selection problem
SIAM Journal on Computing
Optimal checkpointing of real-time tasks
IEEE Transactions on Computers
On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems
IEEE Transactions on Software Engineering
IEEE Transactions on Parallel and Distributed Systems
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Convex Optimization
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Hi-index | 0.00 |
Computer systems with increasing failure rate (IFR) are common in practice. For such systems, the literature indicates that aperiodic checkpointing can provide better performance than periodic checkpointing due to its adaptability to the failure process. However, for long-running applications, aperiodic checkpointing suffers from substantial operational overhead due to frequent checkpointing operations as the application proceeds. To address this problem, in this paper, we propose to incorporate just-in-time process migration in addition to aperiodic checkpointing for applications running in an IFR system. The goal is to reduce application execution time in the presence of failures. In particular, we present an analytical study of this migration-enhanced fault tolerance scheme (denoted as migCP) by deriving application completion time by using migCP and further determining the optimal migration locations. We demonstrate, through analytical modelling and empirical studies, that migCP outperforms aperiodic checkpointing under a variety of system parameters.