Reducing Recovery Time in a Small Recursively Restartable System

Authors:
George Candea;James Cutler;Armando Fox;Rushabh Doshi;Priyank Garg;Rakesh Gowda
Affiliations:
-;-;-;-;-;-
Venue:
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Year:
2002

Citing 0
Cited 13

Ensuring stable performance for systems that degrade

Proceedings of the 5th international workshop on Software and performance
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
HANet: a framework toward ultimately reliable network services

Journal of Systems and Software
Autonomous recovery in componentized Internet applications

Cluster Computing
Ensuring system performance for cluster and single server systems

Journal of Systems and Software
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Treating bugs as allergies: a safe method for surviving software failures

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Toward recovery-oriented computing

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A self-stabilizing autonomic recoverer for eventual Byzantine software

Journal of Systems and Software
First-aid: surviving and preventing memory management bugs during production runs

Proceedings of the 4th ACM European conference on Computer systems
A survey of software aging and rejuvenation studies

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique forachieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we refine the original proposal and apply the RR philosophy to Mercury, a COTS-based satellite ground station that has been in operation for over 2 years. We develop three techniques for transforming component group boundaries such that time-to-recover is reduced, hence increasing system availability. We also further RR by defining the notions of an oracle, restart group and restart policy, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw design guidelines and lessons for the systematic application of recursive restartability to other software systems amenable to RR.