Dealing efficiently with data-center disasters

Authors:
Svend Frølund;Fernando Pedone
Affiliations:
Hewlett-Packard Laboratories, Software Technology Laboratory, Palo Alto, CA;Hewlett-Packard Laboratories, Software Technology Laboratory, Palo Alto, CA and École Polytechnique Fédérale de Lausanne, CH-1015, Switzerland
Venue:
Journal of Parallel and Distributed Computing
Year:
2003

Citing 14
Cited 0

Actors: a model of concurrent computation in distributed systems

Actors: a model of concurrent computation in distributed systems
The Totem single-ring ordering and membership protocol

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The Totem multiple-ring ordering and topology maintenance protocol

ACM Transactions on Computer Systems (TOCS)
Fault-tolerant broadcasts and related problems

Distributed systems (2nd Ed.)
Bimodal multicast

ACM Transactions on Computer Systems (TOCS)
Communicating sequential processes

Communications of the ACM
Structured virtual synchrony: exploring the bounds of virtual synchronous group communication

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Totally ordered multicast in large-scale systems

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
A Client-Server Oriented Algorithm for Virtually Synchronous Group Membership in WANs

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
A reliable ordered delivery protocol for interconnected local area networks

ICNP '95 Proceedings of the 1995 International Conference on Network Protocols

Quantified Score

Hi-index	0.00

Visualization

Abstract

In disaster-resilient systems, wide-area networks play the dual roles of solution and problem. They allow systems to survive disasters because different parts of a system can be located in geographically dispersed locations. They present a problem since communication of disaster-recovery information along wide-area links has high latency. The challenge is to continuously send disaster-recovery information to backup data centers without seriously degrading the on-line response time of the primary data center. We present a disaster-resilient, atomic broadcast algorithm that meets this challenge.One key to achieving disaster resilience at a reasonable cost is to define an atomic broadcast abstraction that is tailored to the multi-data-center setting. Unlike traditional atomic broadcast abstractions, our hierarchical atomic broadcast (HABcast) abstraction gives different delivery guarantees to processes in different data centers. The HABcast properties reflect the fact that only the processes in the primary data center are online (i.e., connected to clients). Roughly speaking, because processes in a backup data center do not interact with external entities, we can give them weaker delivery guarantees without compromising the overall reliability of the system.Another key to practical disaster resilience is for algorithms to exploit the underlying fail-over mechanism between data centers. The fail-over to a backup data center is initiated by human operators, so-called "push-button" switch-over. Because the fail-over decision is made by a human operator, the system itself does not have to guard against false disaster suspicions, and can thus be more efficient.Our HABcast algorithm exploits the above aspects of disaster-resilient systems. Basically, the algorithm overlays a primary-backup scheme on top of a per-data-center atomic algorithm that broadcasts messages within a single data center. This combination presents some unique challenges, such as handling the simultaneous occurrence of failures and disasters, and preventing the plurality of processes within a single data center from resulting in a plurality of messages being communicated between data centers.