Implementing fault-tolerant services using state machines: beyond replication

  • Authors:
  • Vijay K. Garg

  • Affiliations:
  • Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX

  • Venue:
  • DISC'10 Proceedings of the 24th international conference on Distributed computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as savings in storage and messages during normal operations. Fused state machines may incur higher overhead during recovery from crash or Byzantine faults, but that may be acceptable if the probability of fault is low. Assuming n different state machines, pure replication based schemes require n(f +1) replicas to tolerate f crash faults in a system and n(2f + 1) replicas to tolerate f Byzantine faults. For crash faults, we give an algorithm that requires the optimal f backup state machines for tolerating f faults in the system of n machines. For Byzantine faults, we propose an algorithm that requires only nf + f additional state machines, as opposed to 2nf state machines. Our algorithm combines ideas from coding theory with replication to provide low overhead during normal operation while keeping the number of copies required to tolerate f faults small.