Avoiding disruptive failovers in transaction processing systems with multiple active nodes

Authors:
Gong Su;Arun Iyengar
Affiliations:
-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 18
Cited 0

Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Lightweight causal and atomic group multicast

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Horus: a flexible group communication system

Communications of the ACM
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Reliable broadcast protocols

ACM Transactions on Computer Systems (TOCS)
A Fault-Tolerant Protocol for Atomic Broadcast

IEEE Transactions on Parallel and Distributed Systems
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
The Totem System

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Total order broadcast and multicast algorithms: Taxonomy and survey

ACM Computing Surveys (CSUR)
High Throughput Total Order Broadcast for Cluster Environments

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Sinfonia: a new paradigm for building scalable distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Highly available trading system: experiments with CORBA

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a highly available system for environments such as stock trading, where high request rates and low latency requirements dictate that service disruption on the order of seconds in length can be unacceptable. After a node failure, our system avoids delays in processing due to detecting the failure or transferring control to a back-up node. We achieve this by using multiple primary nodes which process transactions concurrently as peers. If a primary node fails, the remaining primaries continue executing without being delayed at all by the failed primary. Nodes agree on a total ordering for processing requests with a novel low overhead wait-free algorithm that utilizes a small amount of shared memory accessible to the nodes and a simple compare-and-swap like protocol which allows the system to progress at the speed of the fastest node. We have implemented our system on an IBM z990 zSeries eServer mainframe and show experimentally that our system performs well and can transparently handle node failures without causing delays to transaction processing. The efficient implementation of our algorithm for ordering transactions is a critically important factor in achieving good performance.