Implementing Prato, a database on demand service
HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
The FOREVER service for fault/intrusion removal
Proceedings of the 2nd workshop on Recent advances on intrusiton-tolerant systems
IBM Journal of Research and Development
Software Engineering for Self-Adaptive Systems
Barricade: defending systems against operator mistakes
Proceedings of the 5th European conference on Computer systems
Is collaborative QoS the solution to the SOA dependability dilemma?
Architecting dependable systems VII
Towards IT systems capable of managing their health
FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Case-based reasoning for autonomous service failure diagnosis and remediation in software systems
ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
Automatic undo for cloud management via AI planning
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
Supporting undoability in systems operations
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Hi-index | 0.00 |
Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.