F10: a fault-tolerant engineered network

Authors:
Vincent Liu;Daniel Halperin;Arvind Krishnamurthy;Thomas Anderson
Affiliations:
University of Washington;University of Washington;University of Washington;University of Washington
Venue:
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Year:
2013

Citing 21
Cited 6

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
A Survey and Comparision of Fault-Tolerant Multistage Interconnection Networks

Computer
Introduction to parallel algorithms and architectures: array, trees, hypercubes

Introduction to parallel algorithms and architectures: array, trees, hypercubes
Fast Algorithms for Routing Around Faults in Multibutterflies and Randomly-Wired Splitter Networks

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Tolerating Multiple Faults in Multistage Interconnection Networks with Minimal Extra Stages

IEEE Transactions on Computers
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Achieving convergence-free routing using failure-carrying packets

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems

IEEE Transactions on Computers
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Dcell: a scalable and fault-tolerant network structure for data centers

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
PortLand: a scalable fault-tolerant layer 2 data center network fabric

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
BCube: a high performance, server-centric network architecture for modular data centers

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Hedera: dynamic flow scheduling for data center networks

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Network traffic characteristics of data centers in the wild

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Managing data transfers in computer clusters with orchestra

Proceedings of the ACM SIGCOMM 2011 conference
Improving datacenter performance and robustness with multipath TCP

Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Data-driven network connectivity

Proceedings of the 10th ACM Workshop on Hot Topics in Networks
MicroTE: fine grained traffic engineering for data centers

Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Jellyfish: networking data centers randomly

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Proceedings of the 2013 conference on Internet measurement conference
An untold story of redundant clouds: making your service deployment truly reliable

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Scaling IP multicast on datacenter topologies

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Aspen trees: balancing data center fault tolerance, scalability and cost

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Scalable, optimal flow routing in datacenters via local link balancing

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Dahu: commodity switches for direct connect data center networks

ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The data center network is increasingly a cost, reliability and performance bottleneck for cloud computing. Although multi-tree topologies can provide scalable bandwidth and traditional routing algorithms can provide eventual fault tolerance, we argue that recovery speed can be dramatically improved through the co-design of the network topology, routing algorithm and failure detector. We create an engineered network and routing protocol that directly address the failure characteristics observed in data centers. At the core of our proposal is a novel network topology that has many of the same desirable properties as FatTrees, but with much better fault recovery properties. We then create a series of failover protocols that benefit from this topology and are designed to cascade and complement each other. The resulting system, F10, can almost instantaneously reestablish connectivity and load balance, even in the presence of multiple failures. Our results show that following network link and switch failures, F10 has less than 1/7th the packet loss of current schemes. A trace-driven evaluation of MapReduce performance shows that F10's lower packet loss yields a median application-level 30% speedup.