A case study of OSPF behavior in a large enterprise network
Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Principles and Practices of Interconnection Networks
Principles and Practices of Interconnection Networks
Shrink: a tool for failure diagnosis in IP networks
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
WAP5: black-box performance debugging for wide-area systems
Proceedings of the 15th international conference on World Wide Web
Autopilot: automatic data center management
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Structure management for scalable overlay service construction
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
IP fault localization via risk modeling
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Answering what-if deployment and configuration questions with wise
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
PortLand: a scalable fault-tolerant layer 2 data center network fabric
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks
SRDS '09 Proceedings of the 2009 28th IEEE International Symposium on Reliable Distributed Systems
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
The nature of data center traffic: measurements & analysis
Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
R3: resilient routing reconfiguration
Proceedings of the ACM SIGCOMM 2010 conference
Challenges in cloud scale data centers
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
zUpdate: updating data center networks with zero loss
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
An untold story of redundant clouds: making your service deployment truly reliable
Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Per-packet load-balanced, low-latency routing for clos-based data center networks
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Hi-index | 0.00 |
Driven by the soaring demands for always-on and fast-response online services, modern datacenter networks have recently undergone tremendous growth. These networks often rely on commodity hardware to reach immense scale while keeping capital expenses under check. The downside is that commodity devices are prone to failures, raising a formidable challenge for network operators to promptly handle these failures with minimal disruptions to the hosted services. Recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.