Fused state machines for fault tolerance in distributed systems

Authors:
Bharath Balasubramanian;Vijay K. Garg
Affiliations:
Parallel and Distributed Systems Laboratory, Dept. of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX;Parallel and Distributed Systems Laboratory, Dept. of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX
Venue:
OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
Year:
2011

Citing 17
Cited 0

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Reaching Agreement in the Presence of Faults

Journal of the ACM (JACM)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
Closed Partition Lattice and Machine Decomposition

IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
An n log n algorithm for minimizing states in a finite automaton

An n log n algorithm for minimizing states in a finite automaton
Replication algorithms for the World-Wide Web

Journal of Systems Architecture: the EUROMICRO Journal
Algebraic structure theory of sequential machines (Prentice-Hall international series in applied mathematics)

Algebraic structure theory of sequential machines (Prentice-Hall international series in applied mathematics)
DAG-aware AIG rewriting a fresh look at combinational logic synthesis

Proceedings of the 43rd annual Design Automation Conference
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A fusion-based approach for tolerating faults in finite state machines

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing fault-tolerant services using state machines: beyond replication

DISC'10 Proceedings of the 24th international conference on Distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Replication is a standard technique for fault-tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash faults among n machines, replication requires nf additional backup machines. We present a fusion-based solution that requires just f additional backup machines (called fusions or fused backups). In this paper, we first propose a fundamental problem regarding DFSMs, independent of fault tolerance, that has not been explored in the literature so far: Given a machine M, with a set of states and a set of events, can we replace it with machines each containing fewer events than M? To formalize this we define a (k,e)-event decomposition of a given machine M, that is a set of k machines each with at least e events fewer than the event set of M, that acting in parallel, are equivalent to M. We present an algorithm to generate such machines with time complexity O(|XM |3|$#931;M |e), where XM is the set of states and $#931;M the set of events of M. Second, we use our event decomposition algorithm to generate fused backups that can correct faults among a given set of machines. We show that these backups are minimal w.r.t the number of states they contain and the number of events in their event set. Third, we use the notion of locality sensitive hashing to present algorithms for the detection and correction of faults for the fusion-based solution. The algorithm for the detection of Byzantine faults has time complexity O(nf) on average, which is the same as that for replication. The algorithm for the correction of both crash and Byzantine faults has time complexity O(nρf) with high probability (w.h.p), where ρ is the average state reduction achieved by fusion. We show that for small values of n (for most practical systems, n