AdaptGuard: guarding adaptive systems from instability
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
Empirical comparison of techniques for automated failure diagnosis
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
AHAFS subsystem for enhancing operating system health in the cloud computing era
IBM Journal of Research and Development
COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Practical experiences with chronics discovery in large telecommunications systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Session management of correlated multi-stream 3D tele-immersive environments
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Practical experiences with chronics discovery in large telecommunications systems
ACM SIGOPS Operating Systems Review
Diagnosis of software failures using computational geometry
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
There is widespread interest today in developing tools that can diagnose the cause of a system failure accurately and efficiently based on monitoring data collected from the system. Over time, the system monitoring data will contain two types of failure data: (i) annotated failure data L, which is monitoring data collected from failure states of the system, where the cause of failure has been diagnosed and attached as annotations with the data; and (ii) unannotated failure data U. Previous work on wholly- or partially-automated diagnosis focused on L or U in isolation. In this paper, we argue that it is important to consider both L and U together to improve the overall accuracy of diagnosis; and in particular, to proactively move instances from U to L. However, such movement requires manual diagnosis effort from system administrators. Since manual diagnosis is expensive and time-consuming, we propose an algorithm to make the best use of manual effort while maximizing the benefit gained from newly diagnosed instances. We report an experimental evaluation of our algorithm using data from a variety of failures---both single failures and multiple correlated failures---injected in a testbed, as well as with synthetic data.