Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
Introduction to parallel algorithms and architectures: array, trees, hypercubes
Introduction to parallel algorithms and architectures: array, trees, hypercubes
Fast Algorithms for Routing Around Faults in Multibutterflies and Randomly-Wired Splitter Networks
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Tolerating Multiple Faults in Multistage Interconnection Networks with Minimal Extra Stages
IEEE Transactions on Computers
An integrated experimental environment for distributed systems and networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Achieving convergence-free routing using failure-carrying packets
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems
IEEE Transactions on Computers
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Dcell: a scalable and fault-tolerant network structure for data centers
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
PortLand: a scalable fault-tolerant layer 2 data center network fabric
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
BCube: a high performance, server-centric network architecture for modular data centers
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Hedera: dynamic flow scheduling for data center networks
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Network traffic characteristics of data centers in the wild
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Managing data transfers in computer clusters with orchestra
Proceedings of the ACM SIGCOMM 2011 conference
Improving datacenter performance and robustness with multipath TCP
Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
Data-driven network connectivity
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
MicroTE: fine grained traffic engineering for data centers
Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Jellyfish: networking data centers randomly
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters
Proceedings of the 2013 conference on Internet measurement conference
An untold story of redundant clouds: making your service deployment truly reliable
Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Scaling IP multicast on datacenter topologies
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Aspen trees: balancing data center fault tolerance, scalability and cost
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Scalable, optimal flow routing in datacenters via local link balancing
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Dahu: commodity switches for direct connect data center networks
ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
Hi-index | 0.00 |
The data center network is increasingly a cost, reliability and performance bottleneck for cloud computing. Although multi-tree topologies can provide scalable bandwidth and traditional routing algorithms can provide eventual fault tolerance, we argue that recovery speed can be dramatically improved through the co-design of the network topology, routing algorithm and failure detector. We create an engineered network and routing protocol that directly address the failure characteristics observed in data centers. At the core of our proposal is a novel network topology that has many of the same desirable properties as FatTrees, but with much better fault recovery properties. We then create a series of failover protocols that benefit from this topology and are designed to cascade and complement each other. The resulting system, F10, can almost instantaneously reestablish connectivity and load balance, even in the presence of multiple failures. Our results show that following network link and switch failures, F10 has less than 1/7th the packet loss of current schemes. A trace-driven evaluation of MapReduce performance shows that F10's lower packet loss yields a median application-level 30% speedup.