Automatic Model-Driven Recovery in Distributed Systems

Authors:
Kaustubh R. Joshi;William H. Sanders;Matti A. Hiltunen;Richard D. Schlichting
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;AT&T Labs Research;AT&T Labs Research
Venue:
SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
Year:
2005

Citing 0
Cited 11

Implementing Prato, a database on demand service

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
The FOREVER service for fault/intrusion removal

Proceedings of the 2nd workshop on Recent advances on intrusiton-tolerant systems
Galapagos: model-driven discovery of end-to-end application-storage relationships in distributed systems

IBM Journal of Research and Development
Using Filtered Cartesian Flattening and Microrebooting to Build Enterprise Applications with Self-adaptive Healing

Software Engineering for Self-Adaptive Systems
Barricade: defending systems against operator mistakes

Proceedings of the 5th European conference on Computer systems
Is collaborative QoS the solution to the SOA dependability dilemma?

Architecting dependable systems VII
Towards IT systems capable of managing their health

FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Case-based reasoning for autonomous service failure diagnosis and remediation in software systems

ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
Automatic undo for cloud management via AI planning

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems

International Journal of Adaptive, Resilient and Autonomic Systems
Supporting undoability in systems operations

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.