Checkpoint and Rollback in Asynchronous Distributed Systems

Authors:
H. Higaki;K. Shima;T. Tachikawa;M. Takizawa
Affiliations:
-;-;-;-
Venue:
INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
Year:
1997

Citing 9
Cited 4

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
Optimal checkpointing and local recording for domino-free rollback recovery

Information Processing Letters
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Crash Recovery in Distributed Systems

Proceedings of the Tenth Conference on Foundations of Software Technology and Theoretical Computer Science
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

Protocol for Taking Object-Based Checkpoints

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Performance analysis of an error sharing agent running on a multimedia collaboration home study system

MMACTE'05 Proceedings of the 7th WSEAS International Conference on Mathematical Methods and Computational Techniques In Electrical Engineering
A comparison of hooking with snatch method for an error detection on multimedia collaboration environment

AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
A hooking method running on MHAP environment

ACE'10 Proceedings of the 9th WSEAS international conference on Applications of computer engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) multiple processes can simultaneously initiate the checkpointing; (2) no additional message is transmitted for taking checkpoints; (3) a set of local checkpoints taken by multiple processes denotes a consistent global state; (4) multiple processes can initiate simultaneously the rollback recovery; (5) the minimum number of processes are rolled back; and (6) each process is rolled back asynchronously. The number of messages for rolling back the processes is O(l) where l is the number of channels. Therefore, the system is kept highly available by the algorithm presented.