Autonomous recovery in componentized Internet applications
Cluster Computing
A supervised learning approach for routing optimizations in wireless sensor networks
REALMAN '06 Proceedings of the 2nd international workshop on Multi-hop ad hoc networks: from theory to reality
Mining for misconfigured machines in grid systems
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining web logs to debug distant connectivity problems
Proceedings of the 2006 SIGCOMM workshop on Mining network data
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Event summarization for system management
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of rule interestingness measures in medical knowledge discovery in databases
Artificial Intelligence in Medicine
Predicting link quality using supervised learning in wireless sensor networks
ACM SIGMOBILE Mobile Computing and Communications Review
Application of autonomic agents for global information grid management and security
Proceedings of the 2007 Summer Computer Simulation Conference
SPIKE: best practice generation for storage area networks
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Snitch: interactive decision trees for troubleshooting misconfigurations
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Learning and multiagent reasoning for autonomous agents
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
WSEAS Transactions on Systems and Control
Problem classification method to enhance the ITIL incident and problem
IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Performance debugging in data centers: doing more with less
COMSNETS'09 Proceedings of the First international conference on COMmunication Systems And NETworks
Adaptive job routing and scheduling
Engineering Applications of Artificial Intelligence
DRACA: decision support for root cause analysis and change impact analysis for CMDBs
CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
Diagnosis of recurrent faults using log files
CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
Adding diagnostics to intelligent robot systems
IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
On the use of computational geometry to detect software faults at runtime
Proceedings of the 7th international conference on Autonomic computing
Monalytics: online monitoring and analytics for managing large scale data centers
Proceedings of the 7th international conference on Autonomic computing
Automated debugging of SLO violations in enterprise systems
COMSNETS'10 Proceedings of the 2nd international conference on COMmunication systems and NETworks
Empirical comparison of techniques for automated failure diagnosis
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Mining hot clusters of similar anomalies for system management
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Diagnostics for debugging speech recognition systems
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Proceedings of the 6th International COnference
A reinforcement learning based self-healing algorithm for managing context adaptation
Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Ubiquitous knowledge discovery
Ubiquitous knowledge discovery
Practical experiences with chronics discovery in large telecommunications systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Practical experiences with chronics discovery in large telecommunications systems
ACM SIGOPS Operating Systems Review
Diagnosis of software failures using computational geometry
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Requirements-Driven root cause analysis using markov logic networks
CAiSE'12 Proceedings of the 24th international conference on Advanced Information Systems Engineering
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
Root cause detection in a service-oriented architecture
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components. We evaluate this approach using actual failures from eBay, and find that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives. We discuss some results in applying simplified decision trees on eBayýs production site for several months. In addition, we give a cost-benefit analysis of manual vs. automated diagnosis systems. Our contributions include the statistical learning approach, the adaptation of decision trees to the context of failure diagnosis, and the deployment and evaluation of our tools on a high-volume production service.