A case study of OSPF behavior in a large enterprise network
Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Experimental Study of Internet Stability and Backbone Failures
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Experiences With Monitoring OSPF on a Regional Service Provider Network
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
OpenFlow: enabling innovation in campus networks
ACM SIGCOMM Computer Communication Review
A study of end-to-end web access failures
CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
Floodless in seattle: a scalable ethernet architecture for large enterprises
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A policy-aware switching layer for data centers
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Dcell: a scalable and fault-tolerant network structure for data centers
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Characterization of failures in an operational IP backbone network
IEEE/ACM Transactions on Networking (TON)
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
PortLand: a scalable fault-tolerant layer 2 data center network fabric
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
BCube: a high performance, server-centric network architecture for modular data centers
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Symbiotic routing in future data centers
Proceedings of the ACM SIGCOMM 2010 conference
Proceedings of the ACM SIGCOMM 2010 conference
California fault lines: understanding the causes and impact of network failures
Proceedings of the ACM SIGCOMM 2010 conference
A first look at problems in the cloud
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Network traffic characteristics of data centers in the wild
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Declarative configuration management for complex and dynamic networks
Proceedings of the 6th International COnference
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Latency inflation with MPLS-based traffic engineering
Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
A guided tour of data-center networking
Communications of the ACM
A Guided Tour through Data-center Networking
Queue - Networks
Understanding the effects and implications of compute node related failures in hadoop
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
User-level data center tomography
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Surviving failures in bandwidth-constrained datacenters
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
User-level data center tomography
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Surviving failures in bandwidth-constrained datacenters
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
An approach for failure recognition in IP-based industrial control networks and systems
International Journal of Network Management
Automatic test packet generation
Proceedings of the 8th international conference on Emerging networking experiments and technologies
Machine-verified network controllers
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Incrementally upgradable data center architecture using hyperbolic tessellations
Computer Networks: The International Journal of Computer and Telecommunications Networking
Integrating scale out and fault tolerance in stream processing using operator state management
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cloud API issues: an empirical study and impact
Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
An empirical analysis of intra- and inter-datacenter network failures for geo-distributed services
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Ensuring connectivity via data plane mechanisms
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
F10: a fault-tolerant engineered network
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
SIMPLE-fying middlebox policy enforcement using SDN
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Using dark fiber to displace diesel generators
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters
Proceedings of the 2013 conference on Internet measurement conference
A comparison of syslog and IS-IS for network failure analysis
Proceedings of the 2013 conference on Internet measurement conference
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
When the network crumbles: an empirical study of cloud network failures and their impact on services
Proceedings of the 4th annual Symposium on Cloud Computing
A study of application-level recovery methods for transient network faults
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Scaling IP multicast on datacenter topologies
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Aspen trees: balancing data center fault tolerance, scalability and cost
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Plinko: building provably resilient forwarding tables
Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks
Hi-index | 0.02 |
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.