A Reinforcement Learning Approach to Automatic Error Recovery

Authors:
Qijun Zhu;Chun Yuan
Affiliations:
Tianjin University, China;Microsoft Research Asia, China
Venue:
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Year:
2007

Citing 0
Cited 2

On-line adaptive algorithms in autonomic restart control

ATC'10 Proceedings of the 7th international conference on Autonomic and trusted computing
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing complexity of modern computer systems makes fault detection and localization prohibitively expensive, and therefore fast recovery from failures is becoming more and more important. A significant fraction of failures can be cured by executing specific repair actions, e.g. rebooting, even when the exact root causes are unknown. However, designing reasonable recovery policies to effectively schedule potential repair actions could be difficult and error prone. In this paper, we present a novel approach to automate recovery policy generation with Reinforcement Learning techniques. Based on the recovery history of the original user-defined policy, our method can learn a new, locally optimal policy that outperforms the original one. In our experimental work on data from a real cluster environment, we found that the automatically generated policy can save 10% of machine downtime.