NetPilot: automating datacenter network failure mitigation

Authors:
Xin Wu;Daniel Turner;Chao-Chih Chen;David A. Maltz;Xiaowei Yang;Lihua Yuan;Ming Zhang
Affiliations:
Duke University, Durham, NC, USA;University of California, San Diego, San Diego, CA, USA;University of California, Davis, Davis, USA;Microsoft, Bellevue, WA, USA;Duke University, Durham, USA;Microsoft, Bellevue, WA, USA;Microsoft, Redmond, WA, USA
Venue:
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Year:
2012

Citing 20
Cited 4

A case study of OSPF behavior in a large enterprise network

Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Shrink: a tool for failure diagnosis in IP networks

Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Autopilot: automatic data center management

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Structure management for scalable overlay service construction

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
IP fault localization via risk modeling

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Answering what-if deployment and configuration questions with wise

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
PortLand: a scalable fault-tolerant layer 2 data center network fabric

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks

SRDS '09 Proceedings of the 2009 28th IEEE International Symposium on Reliable Distributed Systems
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
R3: resilient routing reconfiguration

Proceedings of the ACM SIGCOMM 2010 conference

Challenges in cloud scale data centers

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
zUpdate: updating data center networks with zero loss

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
An untold story of redundant clouds: making your service deployment truly reliable

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Per-packet load-balanced, low-latency routing for clos-based data center networks

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Driven by the soaring demands for always-on and fast-response online services, modern datacenter networks have recently undergone tremendous growth. These networks often rely on commodity hardware to reach immense scale while keeping capital expenses under check. The downside is that commodity devices are prone to failures, raising a formidable challenge for network operators to promptly handle these failures with minimal disruptions to the hosted services. Recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.