Progressive Retry for Software Failure Recovery in Message-Passing Applications

Authors:
Yi-Min Wang;Yennun Huang;W. Kent Fuchs;Chandra Kintala
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Computers
Year:
1997

Citing 8
Cited 8

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Data Diversity: An Approach to Software Fault Tolerance

IEEE Transactions on Computers - Fault-Tolerant Computing
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
High-Availability Computer Systems

Computer
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
An implementation and performance measurement of the progressive retry technique

IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium

Debugging distributed programs using controlled re-execution

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Availability analysis and improvement of active/standby cluster systems using software rejuvenation

Journal of Systems and Software
Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

Proceedings of the 13th International Symposium on Distributed Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
Predicate control: synchronization in distributed computations with look-ahead

Journal of Parallel and Distributed Computing
Finding missing synchronization in a distributed computation using controlled re-execution

Distributed Computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference

Quantified Score

Hi-index	14.98

Visualization

Abstract

A method of execution retry for bypassing software faults in message-passing applications is described in this paper. Based on the techniques of checkpointing and message logging, we demonstrate the use of message replaying and message reordering as two mechanisms for achieving localized and fast recovery. The approach gradually increases the rollback distance and the number of affected processes when a previous retry fails, and is therefore named progressive retry. Examples from telecommunications software systems and performance measurements from an application-level implementation are described to illustrate the benefits of the scheme.