Implementing fault-tolerant services using state machines: beyond replication

Authors:
Vijay K. Garg
Affiliations:
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX
Venue:
DISC'10 Proceedings of the 24th international conference on Distributed computing
Year:
2010

Citing 18
Cited 1

Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.

ACM Transactions on Programming Languages and Systems (TOPLAS)
Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
The causal ordering abstraction and a simple way to implement it

Information Processing Letters
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
Practical loss-resilient codes

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
A digital fountain approach to reliable distribution of bulk data

Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
Reaching Agreement in the Presence of Faults

Journal of the ACM (JACM)
An optimal algorithm for mutual exclusion in computer networks

Communications of the ACM
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Introduction to Coding Theory

Introduction to Coding Theory
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Note: Correction to the 1997 tutorial on Reed–Solomon coding

Software—Practice & Experience - Research Articles
Fusible Data Structures for Fault-Tolerance

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
A fusion-based approach for tolerating faults in finite state machines

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

Fused state machines for fault tolerance in distributed systems

OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as savings in storage and messages during normal operations. Fused state machines may incur higher overhead during recovery from crash or Byzantine faults, but that may be acceptable if the probability of fault is low. Assuming n different state machines, pure replication based schemes require n(f +1) replicas to tolerate f crash faults in a system and n(2f + 1) replicas to tolerate f Byzantine faults. For crash faults, we give an algorithm that requires the optimal f backup state machines for tolerating f faults in the system of n machines. For Byzantine faults, we propose an algorithm that requires only nf + f additional state machines, as opposed to 2nf state machines. Our algorithm combines ideas from coding theory with replication to provide low overhead during normal operation while keeping the number of copies required to tolerate f faults small.