Understanding network failures in data centers: measurement, analysis, and implications

Authors:
Phillipa Gill;Navendu Jain;Nachiappan Nagappan
Affiliations:
University of Toronto, Toronto, Canada;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the ACM SIGCOMM 2011 conference
Year:
2011

Citing 24
Cited 31

A case study of OSPF behavior in a large enterprise network

Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Experimental Study of Internet Stability and Backbone Failures

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Experiences With Monitoring OSPF on a Regional Service Provider Network

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
OpenFlow: enabling innovation in campus networks

ACM SIGCOMM Computer Communication Review
A study of end-to-end web access failures

CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
Floodless in seattle: a scalable ethernet architecture for large enterprises

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A policy-aware switching layer for data centers

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Dcell: a scalable and fault-tolerant network structure for data centers

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Characterization of failures in an operational IP backbone network

IEEE/ACM Transactions on Networking (TON)
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
PortLand: a scalable fault-tolerant layer 2 data center network fabric

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
BCube: a high performance, server-centric network architecture for modular data centers

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Symbiotic routing in future data centers

Proceedings of the ACM SIGCOMM 2010 conference
Data center TCP (DCTCP)

Proceedings of the ACM SIGCOMM 2010 conference
California fault lines: understanding the causes and impact of network failures

Proceedings of the ACM SIGCOMM 2010 conference
A first look at problems in the cloud

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Network traffic characteristics of data centers in the wild

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Declarative configuration management for complex and dynamic networks

Proceedings of the 6th International COnference
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Latency inflation with MPLS-based traffic engineering

Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
A guided tour of data-center networking

Communications of the ACM
A Guided Tour through Data-center Networking

Queue - Networks
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
User-level data center tomography

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Surviving failures in bandwidth-constrained datacenters

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
User-level data center tomography

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Surviving failures in bandwidth-constrained datacenters

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
An approach for failure recognition in IP-based industrial control networks and systems

International Journal of Network Management
Automatic test packet generation

Proceedings of the 8th international conference on Emerging networking experiments and technologies
Machine-verified network controllers

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Incrementally upgradable data center architecture using hyperbolic tessellations

Computer Networks: The International Journal of Computer and Telecommunications Networking
Integrating scale out and fault tolerance in stream processing using operator state management

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cloud API issues: an empirical study and impact

Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
An empirical analysis of intra- and inter-datacenter network failures for geo-distributed services

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Ensuring connectivity via data plane mechanisms

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
F10: a fault-tolerant engineered network

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
SIMPLE-fying middlebox policy enforcement using SDN

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Using dark fiber to displace diesel generators

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
CAP for networks

Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Proceedings of the 2013 conference on Internet measurement conference
A comparison of syslog and IS-IS for network failure analysis

Proceedings of the 2013 conference on Internet measurement conference
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing
A study of application-level recovery methods for transient network faults

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Scaling IP multicast on datacenter topologies

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Aspen trees: balancing data center fault tolerance, scalability and cost

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Plinko: building provably resilient forwarding tables

Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks
Visualizing sparse internet events: network outages and route changes

Computing

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.