Optimistic Crash Recovery without Changing Application Messages

Authors:
S. Venkatesan;Tony Tong-Ying Juang;Sridhar Alagar
Affiliations:
Univ. of Texas at Dallas, Richardson;Chung-Hua Polytechnic Institute, Hsin Chu, Taiwan;Univ. of Texas at Dallas, Richardson
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1997

Citing 17
Cited 11

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Efficient algorithms for crash recovery in distributed systems

FST and TC 10 Proceedings of the tenth conference on Foundations of software technology and theoretical computer science
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A Distributed Algorithm for Minimum-Weight Spanning Trees

ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Reliable Computer Systems

Reliable Computer Systems
A Dynamic Information-Structure Mutual Exclusion Algorithm for Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
A Distributed Recovery Block Approach to Fault-Tolerant Execution of Application Tasks in Hypercubes

IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation

IEEE Transactions on Software Engineering
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Efficient algorithms for optimistic crash recovery

Distributed Computing

A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems

ACM SIGOPS Operating Systems Review
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On designing direct dependency: based fast recovery algorithms for distributed systems

ACM SIGOPS Operating Systems Review
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
A novel non-block synchronous checkpointing scheme for distributed systems

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation

GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Domino-effect free crash recovery for concurrent failures in cluster federation

GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
A low-overhead non-block checkpointing algorithm for mobile computing environment

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an optimistic crash recovery technique without any communication overhead during normal operations of the distributed system. Our technique does not append any information to the application messages, it does not suffer from the domino effect, and each processor rolls back at most once during recovery. We present three distributed rollback algorithms, their complexities, and correctness proofs. Their performances are measured through extensive simulations.