Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications

Authors:
Bina Ramamurthy;Shambhu Upadhyaya;Bharat Bhargava
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2000

Citing 13
Cited 0

A measurement-based model for workload dependence of CPU errors

IEEE Transactions on Computers - The MIT Press scientific computation series
An Experimental Study to Determine Task Size for Rollback Recovery Systems

IEEE Transactions on Computers
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Fault-tolerant computer system design

Fault-tolerant computer system design
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis

IEEE Transactions on Computers
CSIM: a C-based process-oriented simulation language

WSC '86 Proceedings of the 18th conference on Winter simulation
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

IEEE Parallel & Distributed Technology: Systems & Technology
Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring

IEEE Transactions on Computers
Concurrent Process Monitoring with No Reference Signatures

IEEE Transactions on Computers
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
An Object-Oriented Testbed for the Evaluation of Checkpointing and Recovery Systems

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Cache based fault recovery for distributed systems

ICECCS '97 Proceedings of the Third IEEE International Conference on Engineering of Complex Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency which is the main source of multistep rollback in distributed systems is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms, and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework.