Surviving failures in bandwidth-constrained datacenters

Authors:
Peter Bodík;Ishai Menache;Mosharaf Chowdhury;Pradeepkumar Mani;David A. Maltz;Ion Stoica
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;UC Berkeley, Berkeley, CA, USA;Microsoft, Redmond, WA, USA;Microsoft, Redmond, WA, USA;UC Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Year:
2012

Citing 25
Cited 5

A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
A flexible model for resource management in virtual private networks

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Provisioning a virtual private network: a network design problem for multicommodity flow

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
A solver for the network testbed mapping problem

ACM SIGCOMM Computer Communication Review
Measurement based characterization and provisioning of IP VPNs

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Availability of multi-object operations

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Dynamic function placement for data-intensive cluster computing

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Rethinking virtual network embedding: substrate support for path splitting and migration

ACM SIGCOMM Computer Communication Review
Partitioning graphs into balanced components

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
PortLand: a scalable fault-tolerant layer 2 data center network fabric

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
A survey of network virtualization

Computer Networks: The International Journal of Computer and Telecommunications Networking
Improving the scalability of data center networks with traffic-aware virtual machine placement

INFOCOM'10 Proceedings of the 29th conference on Information communications
Volley: automated data placement for geo-distributed cloud services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Network traffic characteristics of data centers in the wild

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
SecondNet: a data center network virtualization architecture with bandwidth guarantees

Proceedings of the 6th International COnference
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sharing the data center network

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Towards predictable datacenter networks

Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Survivable virtual network embedding

NETWORKING'10 Proceedings of the 9th IFIP TC 6 international conference on Networking
ViNEYard: virtual network embedding algorithms with coordinated node and link mapping

IEEE/ACM Transactions on Networking (TON)
Failure-Oriented Path Restoration Algorithm for Survivable Networks

IEEE Transactions on Network and Service Management
Survivable Routing of Mesh Topologies in IP-over-WDM Networks by Recursive Graph Contraction

IEEE Journal on Selected Areas in Communications

Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Increasing network resilience through edge diversity in NEBULA

ACM SIGMOBILE Mobile Computing and Communications Review
Leveraging endpoint flexibility in data-intensive clusters

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
ElasticSwitch: practical work-conserving bandwidth guarantees for cloud computing

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Corybantic: towards the modular composition of SDN control programs

Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.