Simplifying fault-tolerance: providing the abstraction of crash failures

Authors:
Rida A. Bazzi;Gil Neiger
Affiliations:
Arizona State Univ., Tempe, Arizona;Intel Corp., Hillsboro, Oregon
Venue:
Journal of the ACM (JACM)
Year:
2001

Citing 15
Cited 8

Distributed agreement in the presence of processor and communication faults

IEEE Transactions on Software Engineering
A communication-efficient canonical form for fault-tolerant distributed protocols

PODC '86 Proceedings of the fifth annual ACM symposium on Principles of distributed computing
Asynchronous byzantine agreement protocols

Information and Computation
Achieving consensus in fault-tolerant distributed computer systems: protocols, lower bounds, and simulations

Achieving consensus in fault-tolerant distributed computer systems: protocols, lower bounds, and simulations
A Compiler that Increases the Fault Tolerance of Asynchronous Protocols

IEEE Transactions on Computers
Knowledge and common knowledge in a distributed environment

Journal of the ACM (JACM)
Automatically increasing the fault-tolerance of distributed algorithms

Journal of Algorithms
Consensus in the presence of timing uncertainty: omission and Byzantine failures (extended abstract)

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
The possibility and the complexity of achieving fault-tolerant coordination

PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
Bounds on the time to reach agreement in the presence of timing uncertainty

Journal of the ACM (JACM)
Automatically increasing fault tolerance in distributed systems

Automatically increasing fault tolerance in distributed systems
Fully Polynomial Byzantine Agreement for Processors in Rounds

SIAM Journal on Computing
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing)

Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing)
Common knowledge and consistent simultaneous coordination

Distributed Computing

Hundreds of impossibility results for distributed computing

Distributed Computing - Papers in celebration of the 20th anniversary of PODC
The perfectly synchronized round-based model of distributed computing

Information and Computation
Adaptive timeliness of consensus in presence of crash and timing faults

Journal of Parallel and Distributed Computing
PeerReview: practical accountability for distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Nysiad: practical protocol transformation to tolerate Byzantine failures

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Narrowing power vs efficiency in synchronous set agreement: Relationship, algorithms and lower bound

Theoretical Computer Science
Making distributed applications robust

OPODIS'07 Proceedings of the 11th international conference on Principles of distributed systems
Byzantine renaming in synchronous systems with t

Proceedings of the 2013 ACM symposium on Principles of distributed computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

The difficulty of designing fault-tolerant distributed algorithms incr eases with the severity of failures that an algorithm must tolerate, especially for systems with synchronous message passing. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Such translations can be quantified by two measures: fault-tolerance, which is a measure of how many processors must remain correct for the translation to be correct, and round-complexity, which is a measure of how the translation increases the running time of an algorithm. Understanding these translations and their limitations with respect to these measures can provide insight into the relative impact of different models of faculty behavior on the ability to provide fault-tolerant applications for systems with synchronous message passing.This paper considers translations fr om crash failures to each of the following types of more severe failures: omission to send messages; omission to send and receive messages; and totally arbitrary behavior. It shows that previously developed translaions to send-omission failures are optimal with respect to both fault-tolerance and round-complexity. It exhibits a hierarchy of translations to general (send/receive) omission failures that improves upon the fault-tolerance of previously developed translations. These translations are optimal in that they cannot be improved with respect to one measure without negatively affecting the other; that is, the hierarchy of translations is matched by corresponding hierarchy of impossibility results. The paper also gives a hierarchy of translations to arbitrary failures that improves upon the round-complexity of previously developed translations. These translations are near-optimal;