Efficient Rollback-Recovery Technique in Distributed Computing Systems

Authors:
Ge-Ming Chiu;Cheng-Ru Young
Affiliations:
-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1996

Citing 12
Cited 2

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Efficient algorithms for crash recovery in distributed systems

FST and TC 10 Proceedings of the tenth conference on Foundations of software technology and theoretical computer science
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Principal Features of the VOLTAN Family of Reliable Node Architectures for Distributed Systems

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Distributed Systems: Concepts and Design

Distributed Systems: Concepts and Design
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles

A checkpoint-based high availability run-time system for Windows NT clusters

ACM SIGOPS Operating Systems Review
Finding a Recovery Line in Uncoordinated Checkpointing

ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this paper we propose a new approach for implementing rollback-recovery in a distributed computing system. A concept of logical ring is introduced for the maintenance of information required for consistent recovery from a system crash. Message processing order of a process is kept by all other processes on its logical ring. Transmission of data messages are accompanied by the circulation of the associated order messages on the ring. The sizes of the order messages are small. In addition, redundant transmission of order information is avoided, thereby reducing the communication overhead incurred during failure-free operation. Furthermore, updating of the order information and garbage collection task are simplified in the proposed mechanism. Our approach does not require information about message processing order be written to stable storage; in fact, the time-consuming operations of saving information in stable storage are confined to the checkpointing activities. When failures occur, a surviving process need roll back only if some preceding order information is totally lost, which is relatively unlikely considering the ever growing speed of communication networks. It is shown that a system can recover correctly as long as there exists at least one surviving process.