Correlating resource demand information with ARM data for application services
Proceedings of the 1st international workshop on Software and performance
The Vision of Autonomic Computing
Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Monitoring and Diagnosing Application Response Time with ARM
SMW '98 Proceedings of the IEEE Third International Workshop on Systems Management
Profiling Java applications using code hotswapping and dynamic call graph revelation
WOSP '04 Proceedings of the 4th international workshop on Software and performance
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A comparative study of pairwise regression techniques for problem determination
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Snitch: interactive decision trees for troubleshooting misconfigurations
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
QMON: QoS- and Utility-Aware Monitoring in Enterprise Systems
ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Leveraging many simple statistical models to adaptively monitor software systems
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Ensuring high availability, adequate performance, and proper operation of enterprise software systems requires continuous monitoring. Today, most systems operate with minimal monitoring, typically based on service-level objectives (SLOs). Detailed metric-based monitoring is often too costly to use in production, while tracing is prohibitively expensive. Configuring monitoring when problems occur is a manual process.In this paper we propose an alternative: Minimal monitoring with SLOs is used to detect errors. When an error is detected, detailed monitoring is automatically enabled to validate errors using invariant-correlation models. If validated, Application-Response-Measurement (ARM) tracing is dynamically activated on the faulty subsystem and a healthy peer to perform differential trace-data analysis and diagnosis.Based on fault-injection experiments, we show that our system is effective; it correctly detected and validated errors caused by 14 out of 15 injected faults. Differential analysis of the trace data collected for 210 seconds allowed us to top-rank the faulty component in 80% of the cases. In the remaining cases the faulty component was ranked within the top-7 out of 81 components. We also demonstrate that the overhead of our system is low; given a false positive rate of one per hour, the overhead is less than 2.5%.