A planning-based approach to failure recovery in distributed systems

  • Authors:
  • Alexander L. Wolf;Naveed Arshad

  • Affiliations:
  • University of Colorado at Boulder;University of Colorado at Boulder

  • Venue:
  • A planning-based approach to failure recovery in distributed systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated failure recovery in distributed systems poses a tough challenge because of myriad requirements and dependencies among its components. Moreover, failure scenarios are usually unpredictable so they cannot easily be foreseen. Therefore, it is not practical to enumerate all possible failure scenarios and a way to recover a distributed system for each of them. Due to this reason, present failure recovery techniques are highly manual and have considerable downtime associated with them. In this dissertation, we have developed a planning-based approach to automated failure recovery in distributed component-based systems. This approach automates failure recovery through continuous monitoring of the system. Therefore, an exact system state is always available with a failure monitor. When a failure is detected the monitor performs various checks to ensure that it is not a false positive or false negative. A dependency analyzer then checks effects of the failure on other parts of the system. After this an offline planning procedure is performed to take the system from a failed state to a working state. This planning is performed using an artificially intelligent (AI) planner. By using planning, this approach can be used to recover from a variety of failed states and reach any of several acceptable states: from minimal functionality to complete recovery. When a plan is calculated, it is executed onto the system to bring it back to a working state. We have evaluated this technique through various online and synthetic experiments performed on various distributed applications. Our results have shown that this is indeed an effective technique to automatically recover component-based distributed systems from a failure. Our results have also shown that this technique can also scale to large-scale distributed systems.