A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Reaching Agreement in the Presence of Faults
Journal of the ACM (JACM)
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
Closed Partition Lattice and Machine Decomposition
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
An n log n algorithm for minimizing states in a finite automaton
An n log n algorithm for minimizing states in a finite automaton
Replication algorithms for the World-Wide Web
Journal of Systems Architecture: the EUROMICRO Journal
Algebraic structure theory of sequential machines (Prentice-Hall international series in applied mathematics)
DAG-aware AIG rewriting a fresh look at combinational logic synthesis
Proceedings of the 43rd annual Design Automation Conference
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
A fusion-based approach for tolerating faults in finite state machines
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing fault-tolerant services using state machines: beyond replication
DISC'10 Proceedings of the 24th international conference on Distributed computing
Hi-index | 0.00 |
Replication is a standard technique for fault-tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash faults among n machines, replication requires nf additional backup machines. We present a fusion-based solution that requires just f additional backup machines (called fusions or fused backups). In this paper, we first propose a fundamental problem regarding DFSMs, independent of fault tolerance, that has not been explored in the literature so far: Given a machine M, with a set of states and a set of events, can we replace it with machines each containing fewer events than M? To formalize this we define a (k,e)-event decomposition of a given machine M, that is a set of k machines each with at least e events fewer than the event set of M, that acting in parallel, are equivalent to M. We present an algorithm to generate such machines with time complexity O(|XM |3|$#931;M |e), where XM is the set of states and $#931;M the set of events of M. Second, we use our event decomposition algorithm to generate fused backups that can correct faults among a given set of machines. We show that these backups are minimal w.r.t the number of states they contain and the number of events in their event set. Third, we use the notion of locality sensitive hashing to present algorithms for the detection and correction of faults for the fusion-based solution. The algorithm for the detection of Byzantine faults has time complexity O(nf) on average, which is the same as that for replication. The algorithm for the correction of both crash and Byzantine faults has time complexity O(nρf) with high probability (w.h.p), where ρ is the average state reduction achieved by fusion. We show that for small values of n (for most practical systems, n