Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Data Diversity: An Approach to Software Fault Tolerance
IEEE Transactions on Computers - Fault-Tolerant Computing
ACM Transactions on Computer Systems (TOCS)
High-Availability Computer Systems
Computer
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
An implementation and performance measurement of the progressive retry technique
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Debugging distributed programs using controlled re-execution
Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Availability analysis and improvement of active/standby cluster systems using software rejuvenation
Journal of Systems and Software
Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution
Proceedings of the 13th International Symposium on Distributed Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
Predicate control: synchronization in distributed computations with look-ahead
Journal of Parallel and Distributed Computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Hi-index | 14.98 |
A method of execution retry for bypassing software faults in message-passing applications is described in this paper. Based on the techniques of checkpointing and message logging, we demonstrate the use of message replaying and message reordering as two mechanisms for achieving localized and fast recovery. The approach gradually increases the rollback distance and the number of affected processes when a previous retry fails, and is therefore named progressive retry. Examples from telecommunications software systems and performance measurements from an application-level implementation are described to illustrate the benefits of the scheme.