An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging

Authors:
Kuo-Feng Ssu;Bin Yao;W. Kent Fuchs
Affiliations:
-;-;-
Venue:
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Year:
1999

Citing 26
Cited 2

On the optimum checkpoint selection problem

SIAM Journal on Computing
Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems

IEEE Transactions on Computers - Fault-Tolerant Computing
Comparative Analysis of Different Models of Checkpointing and Recovery

IEEE Transactions on Software Engineering
Compiler-assisted full checkpointing

Software—Practice & Experience
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
An On-Line Algorithm for Checkpoint Placement

IEEE Transactions on Computers
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
On the Optimum Checkpoint Interval

Journal of the ACM (JACM)
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
PREACHES - Portable Recovery and Checkpointing in Heterogeneous Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Message Logging in Mobile Computing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Performance Analysis of Two Time-Based Coordinated Checkpointing Protocols

PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
On Patterns for Practical Fault Tolerant Software in Java

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
System-Level Versus User-Defined Checkpointing

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Low-Cost Checkpointing with Mutable Checkpoints in Mobile Computing Systems

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Using Time to Improve the Performance of Coordinated Checkpointing

IPDS '96 Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS '96)
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Online Non-stop Software Update Using Replicated Execution Blocks

COMPSAC '00 24th International Computer Software and Applications Conference
Recovery Support for Internet-Based Real-Time Collaborative Editing Systems

ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures. These solutions are often not applicable due to the lack of accurate data on the probability distribution of failures. Most current checkpoint libraries require application users to define a fixed time interval for checkpointing.The checkpoint interval usually implies the approximate maximum recovery time for single process applications. However, actual recovery time can be much smaller when message logging is used. Due to this faster recovery, checkpointing may be more frequent than needed and thus unnecessary execution overhead is introduced. In this paper, an adaptive checkpointing protocol is developed to accurately enforce the user-defined recovery time and to reduce excessive checkpoints. An adaptive protocol has been implemented and evaluated using a receiver-based message logging algorithm on wired and wireless mobile networks. The results show that the protocol precisely maintains the user-defined maximum recovery times for several traces with varying message exchange rates. The mechanism incurs low overhead, avoids unnecessary checkpointing, and reduces failure free execution time.