Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Stabilizing Communication Protocols
IEEE Transactions on Computers - Special issue on protocol engineering
Synthesis of Communication Protocols: Survey and Assessment
IEEE Transactions on Computers - Special issue on protocol engineering
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Self-stabilizing systems in spite of distributed control
Communications of the ACM
A unified approach to fault-tolerance in communication protocols based on recovery procedures
IEEE/ACM Transactions on Networking (TON)
IEEE/ACM Transactions on Networking (TON)
Hi-index | 0.24 |
This paper addresses the problem of designing stabilizing computer communication protocols modelled by communicating finite state machines. A communication protocol is said to be stabilizing if, starting from or reaching any illegal state, the protocol will eventually reach a legal (or consistent) state, and resume its normal execution. To achieve stabilization, the protocol must be able to detect the error as soon as it occurs, and then it must recover from that error and revert back to a legal protocol state. The later issue related to recovery is tackled here, and an efficient procedure for the recovery in communications protocols is described. The recovery procedure does not require periodic checkpointing and, therefore, is less intrusive. It requires less time for rollback and fewer recovery control messages than other procedures. Only a minimal number of processes will roll back, and a minimal number of protocol messages will be retransmitted during recovery. Moreover, our procedure requires minimal stable storage to be used to record contextual information exchanged during the progress of the protocol. Finally, our procedure is compared with an existing recovery procedure, and an illustrative example is provided.