Detailed diagnosis in enterprise networks

Authors:
Srikanth Kandula;Ratul Mahajan;Patrick Verkaik;Sharad Agarwal;Jitendra Padhye;Paramvir Bahl
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;University of California, San Diego, San Diego, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Year:
2009

Citing 25
Cited 43

Readings in model-based diagnosis

Readings in model-based diagnosis
Similarity-based queries

PODS '95 Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Similarity-based queries for time series data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A tutorial on learning with Bayesian networks

Learning in graphical models
Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
Independent component analysis: algorithms and applications

Neural Networks
Wavelet synopses with error guarantees

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Understanding BGP misconfiguration

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
An Expert System for Real Time Fault Diagnosis of the Italian Telecommunications Network

Proceedings of the IFIP TC6/WG6.6 Third International Symposium on Integrated Network Management with participation of the IEEE Communications Society CNOM and with support from the Institute for Educational Services
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Compressing historical information in sensor networks

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support

LISA '03 Proceedings of the 17th USENIX conference on System administration
Failure Diagnosis Using Decision Trees

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
IP fault localization via risk modeling

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Configuration debugging as search: finding the needle in the haystack

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Flight data recorder: monitoring persistent-state interactions to improve systems management

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
AutoBash: improving configuration management with operating system causality analysis

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
NetPrints: diagnosing home network misconfigurations using shared knowledge

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
High speed and robust event correlation

IEEE Communications Magazine

SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Automated debugging of SLO violations in enterprise systems

COMSNETS'10 Proceedings of the 2nd international conference on COMmunication systems and NETworks
Webprofiler: cooperative diagnosis of web failures

COMSNETS'10 Proceedings of the 2nd international conference on COMmunication systems and NETworks
Detecting the performance impact of upgrades in large operational networks

Proceedings of the ACM SIGCOMM 2010 conference
Crowdsourcing service-level network event monitoring

Proceedings of the ACM SIGCOMM 2010 conference
Instrumenting home networks

Proceedings of the 2010 ACM SIGCOMM workshop on Home networks
WebProphet: automating performance prediction for web services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Relational network-service clustering analysis with set evidences

Proceedings of the 3rd ACM workshop on Artificial intelligence and security
SecureAngle: improving wireless security using angle-of-arrival information

Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Listen to me if you can: tracking user experience of mobile network on social media

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
What happened in my network: mining network events from router syslogs

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

Proceedings of the 6th International COnference
NEVERMIND, the problem is already fixed: proactively detecting and troubleshooting customer DSL problems

Proceedings of the 6th International COnference
Instrumenting home networks

ACM SIGCOMM Computer Communication Review
Advancing the state of home networking

Communications of the ACM
QoSaaS: quality of service as a service

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Profiling network performance for multi-tier data center applications

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Large-scale app-based reporting of customer problems in cellular networks: potential and limitations

Proceedings of the first ACM SIGCOMM workshop on Measurements up the stack
Performance of networked applications: the challenges in capturing the user's perception

Proceedings of the first ACM SIGCOMM workshop on Measurements up the stack
dFault: fault localization in large-scale peer-to-peer systems

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Practical experiences with chronics discovery in large telecommunications systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Q-score: proactive service quality assessment in a large IPTV system

Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
Rapid detection of maintenance induced changes in service performance

Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Deja vu: fingerprinting network problems

Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Practical experiences with chronics discovery in large telecommunications systems

ACM SIGOPS Operating Systems Review
End-user perspectives of Internet connectivity problems

Computer Networks: The International Journal of Computer and Telecommunications Networking
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
NetPilot: automating datacenter network failure mitigation

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Automated diagnosis without predictability is a recipe for failure

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
NetPilot: automating datacenter network failure mitigation

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
An approach for failure recognition in IP-based industrial control networks and systems

International Journal of Network Management
Automated home network troubleshooting with device collaboration

Proceedings of the 2012 ACM conference on CoNEXT student workshop
Theia: visual signatures for problem diagnosis in large hadoop clusters

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
A framework to compute statistics of system parameters from very large trace files

ACM SIGOPS Operating Systems Review
A provider-side view of web search response time

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Proceedings of the 2013 conference on Internet measurement conference
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing
An untold story of redundant clouds: making your service deployment truly reliable

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Adtributor: revenue debugging in advertising systems

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.02

Visualization

Abstract

By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It should also identify culprits at a fine granularity such as a process or firewall configuration. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. It formulates detailed diagnosis as an inference problem that more faithfully captures the behaviors and interactions of fine-grained network components such as processes. The primary challenge in solving this problem is inferring when a component might be impacting another. Our solution is based on an intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of them impacting one another in the present. We find that our deployed prototype is effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as the most likely culprit in 80% of the cases and is almost always in the list of top five culprits.