Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

Authors:
Mohammad A. Munawar;Thomas Reidemeister;Miao Jiang;Allen George;Paul A. Ward
Affiliations:
Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1;Shoshin Distributed Systems Group, University of Waterloo, Waterloo, N2L 3G1
Venue:
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Year:
2008

Citing 14
Cited 0

Correlating resource demand information with ARM data for application services

Proceedings of the 1st international workshop on Software and performance
The Vision of Autonomic Computing

Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Monitoring and Diagnosing Application Response Time with ARM

SMW '98 Proceedings of the IEEE Third International Workshop on Systems Management
Profiling Java applications using code hotswapping and dynamic call graph revelation

WOSP '04 Proceedings of the 4th international workshop on Software and performance
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A comparative study of pairwise regression techniques for problem determination

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Snitch: interactive decision trees for troubleshooting misconfigurations

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
QMON: QoS- and Utility-Aware Monitoring in Enterprise Systems

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks
Leveraging many simple statistical models to adaptively monitor software systems

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ensuring high availability, adequate performance, and proper operation of enterprise software systems requires continuous monitoring. Today, most systems operate with minimal monitoring, typically based on service-level objectives (SLOs). Detailed metric-based monitoring is often too costly to use in production, while tracing is prohibitively expensive. Configuring monitoring when problems occur is a manual process.In this paper we propose an alternative: Minimal monitoring with SLOs is used to detect errors. When an error is detected, detailed monitoring is automatically enabled to validate errors using invariant-correlation models. If validated, Application-Response-Measurement (ARM) tracing is dynamically activated on the faulty subsystem and a healthy peer to perform differential trace-data analysis and diagnosis.Based on fault-injection experiments, we show that our system is effective; it correctly detected and validated errors caused by 14 out of 15 injected faults. Differential analysis of the trace data collected for 210 seconds allowed us to top-rank the faulty component in 80% of the cases. In the remaining cases the faulty component was ranked within the top-7 out of 81 components. We also demonstrate that the overhead of our system is low; given a false positive rate of one per hour, the overhead is less than 2.5%.