A measurement-based model for workload dependence of CPU errors
IEEE Transactions on Computers - The MIT Press scientific computation series
An Experimental Study to Determine Task Size for Rollback Recovery Systems
IEEE Transactions on Computers
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Fault-tolerant computer system design
Fault-tolerant computer system design
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
IEEE Transactions on Computers
CSIM: a C-based process-oriented simulation language
WSC '86 Proceedings of the 18th conference on Winter simulation
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems
IEEE Parallel & Distributed Technology: Systems & Technology
Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring
IEEE Transactions on Computers
Concurrent Process Monitoring with No Reference Signatures
IEEE Transactions on Computers
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
An Object-Oriented Testbed for the Evaluation of Checkpointing and Recovery Systems
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Cache based fault recovery for distributed systems
ICECCS '97 Proceedings of the Third IEEE International Conference on Engineering of Complex Computer Systems
Hi-index | 0.00 |
An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency which is the main source of multistep rollback in distributed systems is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms, and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework.