Aspen trees: balancing data center fault tolerance, scalability and cost

  • Authors:
  • Meg Walraed-Sullivan;Amin Vahdat;Keith Marzullo

  • Affiliations:
  • Microsoft Research, Redmond, WA, USA;Google, UC San Diego, Mountain View, CA, USA;UC San Diego, La Jolla, CA, USA

  • Venue:
  • Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault recovery is a key issue in modern data centers. In a fat tree topology, a single link failure can disconnect a set of end hosts from the rest of the network until updated routing information is disseminated to every switch in the topology. The time for re-convergence can be substantial, leaving hosts disconnected for long periods of time and significantly reducing the overall availability of the data center. Moreover, the message overhead of sending updated routing information to the entire topology may be unacceptable at scale. We present techniques to modify hierarchical data center topologies to enable switches to react to failures locally, thus reducing both the convergence time and control overhead of failure recovery. We find that for a given network size, decreasing a topology's convergence time results in a proportional decrease to its scalability (e.g. the number of hosts supported). On the other hand, reducing convergence time without affecting scalability necessitates the introduction of additional switches and links. We explore the tradeoffs between fault tolerance, scalability and network size, and propose a range of modified multi-rooted tree topologies that provide significantly reduced convergence time while retaining most of the traditional fat tree's desirable properties.