Troubleshooting chronic conditions in large IP networks

  • Authors:
  • Ajay Mahimkar;Jennifer Yates;Yin Zhang;Aman Shaikh;Jia Wang;Zihui Ge;Cheng Tien Ee

  • Affiliations:
  • The University of Texas at Austin;AT&T Labs -- Research;The University of Texas at Austin;AT&T Labs -- Research;AT&T Labs -- Research;AT&T Labs -- Research;AT&T Labs -- Research

  • Venue:
  • CoNEXT '08 Proceedings of the 2008 ACM CoNEXT Conference
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot and repair chronic network conditions in a timely fashion in order to ensure high reliability and performance in large IP networks. Today, troubleshooting chronic conditions is often performed manually, making it a tedious, time-consuming and error-prone process. In this paper, we present NICE (Network-wide Information Correlation and Exploration), a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier-1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier-1 ISP network. In all three case studies conducted so far, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.