Detailed diagnosis in enterprise networks

  • Authors:
  • Srikanth Kandula;Ratul Mahajan;Patrick Verkaik;Sharad Agarwal;Jitendra Padhye;Paramvir Bahl

  • Affiliations:
  • Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;University of California, San Diego, San Diego, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA

  • Venue:
  • Proceedings of the ACM SIGCOMM 2009 conference on Data communication
  • Year:
  • 2009

Quantified Score

Hi-index 0.02

Visualization

Abstract

By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It should also identify culprits at a fine granularity such as a process or firewall configuration. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. It formulates detailed diagnosis as an inference problem that more faithfully captures the behaviors and interactions of fine-grained network components such as processes. The primary challenge in solving this problem is inferring when a component might be impacting another. Our solution is based on an intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of them impacting one another in the present. We find that our deployed prototype is effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as the most likely culprit in 80% of the cases and is almost always in the list of top five culprits.