Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

  • Authors:
  • Mohammad A. Munawar;Thomas Reidemeister;Miao Jiang;Allen George;Paul A. Ward

  • Affiliations:
  • Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1

  • Venue:
  • DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Ensuring high availability, adequate performance, and proper operation of enterprise software systems requires continuous monitoring. Today, most systems operate with minimal monitoring, typically based on service-level objectives (SLOs). Detailed metric-based monitoring is often too costly to use in production, while tracing is prohibitively expensive. Configuring monitoring when problems occur is a manual process.In this paper we propose an alternative: Minimal monitoring with SLOs is used to detect errors. When an error is detected, detailed monitoring is automatically enabled to validate errors using invariant-correlation models. If validated, Application-Response-Measurement (ARM) tracing is dynamically activated on the faulty subsystem and a healthy peer to perform differential trace-data analysis and diagnosis.Based on fault-injection experiments, we show that our system is effective; it correctly detected and validated errors caused by 14 out of 15 injected faults. Differential analysis of the trace data collected for 210 seconds allowed us to top-rank the faulty component in 80% of the cases. In the remaining cases the faulty component was ranked within the top-7 out of 81 components. We also demonstrate that the overhead of our system is low; given a false positive rate of one per hour, the overhead is less than 2.5%.