Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Concurrency control and recovery in database systems
Concurrency control and recovery in database systems
Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Crash Recovery in Distributed Systems
Proceedings of the Tenth Conference on Foundations of Software Technology and Theoretical Computer Science
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Protocol for Taking Object-Based Checkpoints
DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
MMACTE'05 Proceedings of the 7th WSEAS International Conference on Mathematical Methods and Computational Techniques In Electrical Engineering
AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
A hooking method running on MHAP environment
ACE'10 Proceedings of the 9th WSEAS international conference on Applications of computer engineering
Hi-index | 0.00 |
This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) multiple processes can simultaneously initiate the checkpointing; (2) no additional message is transmitted for taking checkpoints; (3) a set of local checkpoints taken by multiple processes denotes a consistent global state; (4) multiple processes can initiate simultaneously the rollback recovery; (5) the minimum number of processes are rolled back; and (6) each process is rolled back asynchronously. The number of messages for rolling back the processes is O(l) where l is the number of channels. Therefore, the system is kept highly available by the algorithm presented.