Fused state machines for fault tolerance in distributed systems

  • Authors:
  • Bharath Balasubramanian;Vijay K. Garg

  • Affiliations:
  • Parallel and Distributed Systems Laboratory, Dept. of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX;Parallel and Distributed Systems Laboratory, Dept. of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX

  • Venue:
  • OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Replication is a standard technique for fault-tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash faults among n machines, replication requires nf additional backup machines. We present a fusion-based solution that requires just f additional backup machines (called fusions or fused backups). In this paper, we first propose a fundamental problem regarding DFSMs, independent of fault tolerance, that has not been explored in the literature so far: Given a machine M, with a set of states and a set of events, can we replace it with machines each containing fewer events than M? To formalize this we define a (k,e)-event decomposition of a given machine M, that is a set of k machines each with at least e events fewer than the event set of M, that acting in parallel, are equivalent to M. We present an algorithm to generate such machines with time complexity O(|XM |3|$#931;M |e), where XM is the set of states and $#931;M the set of events of M. Second, we use our event decomposition algorithm to generate fused backups that can correct faults among a given set of machines. We show that these backups are minimal w.r.t the number of states they contain and the number of events in their event set. Third, we use the notion of locality sensitive hashing to present algorithms for the detection and correction of faults for the fusion-based solution. The algorithm for the detection of Byzantine faults has time complexity O(nf) on average, which is the same as that for replication. The algorithm for the correction of both crash and Byzantine faults has time complexity O(nρf) with high probability (w.h.p), where ρ is the average state reduction achieved by fusion. We show that for small values of n (for most practical systems, n