Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
An O(logN) deterministic packet routing scheme
STOC '89 Proceedings of the twenty-first annual ACM symposium on Theory of computing
Fast Algorithms for Routing Around Faults in Multibutterflies and Randomly-Wired Splitter Networks
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Fast restoration of real-time communication service from component failures in multi-hop networks
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Fault recovery for guaranteed performance communications connections
IEEE/ACM Transactions on Networking (TON)
A Parallel Algorithm for Reconfiguring a Multibutterfly Network with Faulty Switches
IEEE Transactions on Computers
Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels
IEEE Transactions on Parallel and Distributed Systems
Prevention of deadlocks and livelocks in lossless backpressured packet networks
IEEE/ACM Transactions on Networking (TON)
Achieving sub-second IGP convergence in large IP networks
ACM SIGCOMM Computer Communication Review
Mace: language support for building distributed systems
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Achieving convergence-free routing using failure-carrying packets
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Xl: an efficient network routing algorithm
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Dcell: a scalable and fault-tolerant network structure for data centers
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Live Debugging of Distributed Systems
CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
PortLand: a scalable fault-tolerant layer 2 data center network fabric
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
BCube: a high performance, server-centric network architecture for modular data centers
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
DevoFlow: cost-effective flow management for high performance enterprise networks
Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Finding latent performance bugs in systems implementations
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Life, death, and the critical transition: finding liveness bugs in systems code
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Improving datacenter performance and robustness with multipath TCP
Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
ALIAS: scalable, decentralized label assignment for data centers
Proceedings of the 2nd ACM Symposium on Cloud Computing
Jellyfish: networking data centers randomly
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
IEEE Journal on Selected Areas in Communications - Part Supplement
Ensuring connectivity via data plane mechanisms
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
F10: a fault-tolerant engineered network
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Fault recovery is a key issue in modern data centers. In a fat tree topology, a single link failure can disconnect a set of end hosts from the rest of the network until updated routing information is disseminated to every switch in the topology. The time for re-convergence can be substantial, leaving hosts disconnected for long periods of time and significantly reducing the overall availability of the data center. Moreover, the message overhead of sending updated routing information to the entire topology may be unacceptable at scale. We present techniques to modify hierarchical data center topologies to enable switches to react to failures locally, thus reducing both the convergence time and control overhead of failure recovery. We find that for a given network size, decreasing a topology's convergence time results in a proportional decrease to its scalability (e.g. the number of hosts supported). On the other hand, reducing convergence time without affecting scalability necessitates the introduction of additional switches and links. We explore the tradeoffs between fault tolerance, scalability and network size, and propose a range of modified multi-rooted tree topologies that provide significantly reduced convergence time while retaining most of the traditional fat tree's desirable properties.