Dealing efficiently with data-center disasters

  • Authors:
  • Svend Frølund;Fernando Pedone

  • Affiliations:
  • Hewlett-Packard Laboratories, Software Technology Laboratory, Palo Alto, CA;Hewlett-Packard Laboratories, Software Technology Laboratory, Palo Alto, CA and École Polytechnique Fédérale de Lausanne, CH-1015, Switzerland

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In disaster-resilient systems, wide-area networks play the dual roles of solution and problem. They allow systems to survive disasters because different parts of a system can be located in geographically dispersed locations. They present a problem since communication of disaster-recovery information along wide-area links has high latency. The challenge is to continuously send disaster-recovery information to backup data centers without seriously degrading the on-line response time of the primary data center. We present a disaster-resilient, atomic broadcast algorithm that meets this challenge.One key to achieving disaster resilience at a reasonable cost is to define an atomic broadcast abstraction that is tailored to the multi-data-center setting. Unlike traditional atomic broadcast abstractions, our hierarchical atomic broadcast (HABcast) abstraction gives different delivery guarantees to processes in different data centers. The HABcast properties reflect the fact that only the processes in the primary data center are online (i.e., connected to clients). Roughly speaking, because processes in a backup data center do not interact with external entities, we can give them weaker delivery guarantees without compromising the overall reliability of the system.Another key to practical disaster resilience is for algorithms to exploit the underlying fail-over mechanism between data centers. The fail-over to a backup data center is initiated by human operators, so-called "push-button" switch-over. Because the fail-over decision is made by a human operator, the system itself does not have to guard against false disaster suspicions, and can thus be more efficient.Our HABcast algorithm exploits the above aspects of disaster-resilient systems. Basically, the algorithm overlays a primary-backup scheme on top of a per-data-center atomic algorithm that broadcasts messages within a single data center. This combination presents some unique challenges, such as handling the simultaneous occurrence of failures and disasters, and preventing the plurality of processes within a single data center from resulting in a plurality of messages being communicated between data centers.