IP fault localization via risk modeling

  • Authors:
  • Ramana Rao Kompella;Jennifer Yates;Albert Greenberg;Alex C. Snoeren

  • Affiliations:
  • University of California, San Diego;AT&T Labs-Research;AT&T Labs-Research;University of California, San Diego

  • Venue:
  • NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated, rapid, and effective fault management is a central goal of large operational IP networks. Today's networks suffer from a wide and volatile set of failure modes, where the underlying fault proves difficult to detect and localize, thereby delaying repair. One of the main challenges stems from operational reality: IP routing and the underlying optical fiber plant are typically described by disparate data models and housed in distinct network management systems. We introduce a fault-localization methodology based on the use of risk models and an associated troubleshooting system, SCORE (Spatial Correlation Engine), which automatically identifies likely root causes across layers. In particular, we apply SCORE to the problem of localizing link failures in IP and optical networks. In experiments conducted on a tier-1 ISP backbone, SCORE proved remarkably effective at localizing optical link failures using only IP-layer event logs. Moreover, SCORE was often able to automatically uncover inconsistencies in the databases that maintain the critical associations between the IP and optical networks.