On the optimum checkpoint selection problem
SIAM Journal on Computing
Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems
IEEE Transactions on Computers - Fault-Tolerant Computing
Comparative Analysis of Different Models of Checkpointing and Recovery
IEEE Transactions on Software Engineering
Compiler-assisted full checkpointing
Software—Practice & Experience
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
An On-Line Algorithm for Checkpoint Placement
IEEE Transactions on Computers
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
On the Optimum Checkpoint Interval
Journal of the ACM (JACM)
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
PREACHES - Portable Recovery and Checkpointing in Heterogeneous Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Message Logging in Mobile Computing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Performance Analysis of Two Time-Based Coordinated Checkpointing Protocols
PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
On Patterns for Practical Fault Tolerant Software in Java
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
System-Level Versus User-Defined Checkpointing
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Low-Cost Checkpointing with Mutable Checkpoints in Mobile Computing Systems
ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Using Time to Improve the Performance of Coordinated Checkpointing
IPDS '96 Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS '96)
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Compiler-Assisted Checkpointing
Compiler-Assisted Checkpointing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Online Non-stop Software Update Using Replicated Execution Blocks
COMPSAC '00 24th International Computer Software and Applications Conference
Recovery Support for Internet-Based Real-Time Collaborative Editing Systems
ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)
Hi-index | 0.00 |
Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures. These solutions are often not applicable due to the lack of accurate data on the probability distribution of failures. Most current checkpoint libraries require application users to define a fixed time interval for checkpointing.The checkpoint interval usually implies the approximate maximum recovery time for single process applications. However, actual recovery time can be much smaller when message logging is used. Due to this faster recovery, checkpointing may be more frequent than needed and thus unnecessary execution overhead is introduced. In this paper, an adaptive checkpointing protocol is developed to accurately enforce the user-defined recovery time and to reduce excessive checkpoints. An adaptive protocol has been implemented and evaluated using a receiver-based message logging algorithm on wired and wireless mobile networks. The results show that the protocol precisely maintains the user-defined maximum recovery times for several traces with varying message exchange rates. The mechanism incurs low overhead, avoids unnecessary checkpointing, and reduces failure free execution time.