Lessons from Giant-Scale Services
IEEE Internet Computing
A case study of OSPF behavior in a large enterprise network
Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Experimental Study of Internet Stability and Backbone Failures
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Detecting BGP configuration faults with static analysis
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A study of end-to-end web access failures
CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack
ACM Transactions on Storage (TOS)
Characterization of failures in an operational IP backbone network
IEEE/ACM Transactions on Networking (TON)
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
PortLand: a scalable fault-tolerant layer 2 data center network fabric
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
The nature of data center traffic: measurements & analysis
Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
California fault lines: understanding the causes and impact of network failures
Proceedings of the ACM SIGCOMM 2010 conference
Volley: automated data placement for geo-distributed cloud services
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
NetLord: a scalable multi-tenant network architecture for virtualized datacenters
Proceedings of the ACM SIGCOMM 2011 conference
Inter-datacenter bulk transfers with netstitcher
Proceedings of the ACM SIGCOMM 2011 conference
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
An empirical study on configuration errors in commercial and open source systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Bandwidth on demand for inter-data center communication
Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Lightpath restoration in WDM optical networks
IEEE Network: The Magazine of Global Internetworking
Making middleboxes someone else's problem: network processing as a cloud service
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
D4D: Inter-datacenter Bulk Transfers with ISP Friendliness
CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
An empirical analysis of intra- and inter-datacenter network failures for geo-distributed services
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters
Proceedings of the 2013 conference on Internet measurement conference
Hi-index | 0.00 |
The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.