A planning-based approach to failure recovery in distributed systems

Authors:
Alexander L. Wolf;Naveed Arshad
Affiliations:
University of Colorado at Boulder;University of Colorado at Boulder
Venue:
A planning-based approach to failure recovery in distributed systems
Year:
2006

Citing 0
Cited 3

A framework for automated fault recovery planning in large-scale virtualized infrastructures

MACE'10 Proceedings of the 5th IEEE international conference on Modelling autonomic communication environments
Survey: Survey of fault tolerant techniques for grid

Computer Science Review
Runtime verification of service-oriented systems: a well-rounded survey

International Journal of Web and Grid Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automated failure recovery in distributed systems poses a tough challenge because of myriad requirements and dependencies among its components. Moreover, failure scenarios are usually unpredictable so they cannot easily be foreseen. Therefore, it is not practical to enumerate all possible failure scenarios and a way to recover a distributed system for each of them. Due to this reason, present failure recovery techniques are highly manual and have considerable downtime associated with them. In this dissertation, we have developed a planning-based approach to automated failure recovery in distributed component-based systems. This approach automates failure recovery through continuous monitoring of the system. Therefore, an exact system state is always available with a failure monitor. When a failure is detected the monitor performs various checks to ensure that it is not a false positive or false negative. A dependency analyzer then checks effects of the failure on other parts of the system. After this an offline planning procedure is performed to take the system from a failed state to a working state. This planning is performed using an artificially intelligent (AI) planner. By using planning, this approach can be used to recover from a variety of failed states and reach any of several acceptable states: from minimal functionality to complete recovery. When a plan is calculated, it is executed onto the system to bring it back to a working state. We have evaluated this technique through various online and synthetic experiments performed on various distributed applications. Our results have shown that this is indeed an effective technique to automatically recover component-based distributed systems from a failure. Our results have also shown that this technique can also scale to large-scale distributed systems.