Actors: a model of concurrent computation in distributed systems
Actors: a model of concurrent computation in distributed systems
The Totem single-ring ordering and membership protocol
ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
The Totem multiple-ring ordering and topology maintenance protocol
ACM Transactions on Computer Systems (TOCS)
Fault-tolerant broadcasts and related problems
Distributed systems (2nd Ed.)
ACM Transactions on Computer Systems (TOCS)
Communicating sequential processes
Communications of the ACM
Structured virtual synchrony: exploring the bounds of virtual synchronous group communication
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Totally ordered multicast in large-scale systems
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
A Client-Server Oriented Algorithm for Virtually Synchronous Group Membership in WANs
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
A reliable ordered delivery protocol for interconnected local area networks
ICNP '95 Proceedings of the 1995 International Conference on Network Protocols
Hi-index | 0.00 |
In disaster-resilient systems, wide-area networks play the dual roles of solution and problem. They allow systems to survive disasters because different parts of a system can be located in geographically dispersed locations. They present a problem since communication of disaster-recovery information along wide-area links has high latency. The challenge is to continuously send disaster-recovery information to backup data centers without seriously degrading the on-line response time of the primary data center. We present a disaster-resilient, atomic broadcast algorithm that meets this challenge.One key to achieving disaster resilience at a reasonable cost is to define an atomic broadcast abstraction that is tailored to the multi-data-center setting. Unlike traditional atomic broadcast abstractions, our hierarchical atomic broadcast (HABcast) abstraction gives different delivery guarantees to processes in different data centers. The HABcast properties reflect the fact that only the processes in the primary data center are online (i.e., connected to clients). Roughly speaking, because processes in a backup data center do not interact with external entities, we can give them weaker delivery guarantees without compromising the overall reliability of the system.Another key to practical disaster resilience is for algorithms to exploit the underlying fail-over mechanism between data centers. The fail-over to a backup data center is initiated by human operators, so-called "push-button" switch-over. Because the fail-over decision is made by a human operator, the system itself does not have to guard against false disaster suspicions, and can thus be more efficient.Our HABcast algorithm exploits the above aspects of disaster-resilient systems. Basically, the algorithm overlays a primary-backup scheme on top of a per-data-center atomic algorithm that broadcasts messages within a single data center. This combination presents some unique challenges, such as handling the simultaneous occurrence of failures and disasters, and preventing the plurality of processes within a single data center from resulting in a plurality of messages being communicated between data centers.