When the network crumbles: an empirical study of cloud network failures and their impact on services

Authors:
Rahul Potharaju;Navendu Jain
Affiliations:
Purdue University;Microsoft Research
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 29
Cited 0

Lessons from Giant-Scale Services

IEEE Internet Computing
A case study of OSPF behavior in a large enterprise network

Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Experimental Study of Internet Stability and Backbone Failures

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Detecting BGP configuration faults with static analysis

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A study of end-to-end web access failures

CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
Characterization of failures in an operational IP backbone network

IEEE/ACM Transactions on Networking (TON)
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
PortLand: a scalable fault-tolerant layer 2 data center network fabric

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
California fault lines: understanding the causes and impact of network failures

Proceedings of the ACM SIGCOMM 2010 conference
Volley: automated data placement for geo-distributed cloud services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
NetLord: a scalable multi-tenant network architecture for virtualized datacenters

Proceedings of the ACM SIGCOMM 2011 conference
Inter-datacenter bulk transfers with netstitcher

Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
An empirical study on configuration errors in commercial and open source systems

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Bandwidth on demand for inter-data center communication

Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Lightpath restoration in WDM optical networks

IEEE Network: The Magazine of Global Internetworking
Making middleboxes someone else's problem: network processing as a cloud service

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
D4D: Inter-datacenter Bulk Transfers with ISP Friendliness

CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
An empirical analysis of intra- and inter-datacenter network failures for geo-distributed services

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Proceedings of the 2013 conference on Internet measurement conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.