ROC-1: Hardware Support for Recovery-Oriented Computing
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Portability, Extensibility and Robustness in iROS
PERCOM '03 Proceedings of the First IEEE International Conference on Pervasive Computing and Communications
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic detection and repair of errors in data structures
OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
Acceptability-oriented computing
OOPSLA '03 Companion of the 18th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Acceptability-oriented computing
ACM SIGPLAN Notices
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Self-Healing in Modern Operating Systems
Queue - Programming Languages
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Data structure repair using goal-directed reasoning
Proceedings of the 27th international conference on Software engineering
An online evolutionary approach to developing internet services
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Inference and enforcement of data structure consistency specifications
Proceedings of the 2006 international symposium on Software testing and analysis
Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
ACM Transactions on Computer Systems (TOCS)
Goal-Directed Reasoning for Specification-Based Data Structure Repair
IEEE Transactions on Software Engineering
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Using runtime paths for macroanalysis
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Treating bugs as allergies: a safe method for surviving software failures
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Solaris service management facility: modern system startup and administration
LISA '05 Proceedings of the 19th conference on Large Installation System Administration Conference - Volume 19
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Unmodified device driver reuse and improved system dependability via virtual machines
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Toward recovery-oriented computing
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Bristlecone: A Language for Robust Software Systems
ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Automatically patching errors in deployed software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
"Otherworld": giving applications a chance to survive OS kernel crashes
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Gestalt: integrated support for implementation and analysis in machine learning
UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Lowering the barrier to applying machine learning
UIST '10 Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology
Recovery tasks: an automated approach to failure recovery
RV'10 Proceedings of the First international conference on Runtime verification
Exception handling in the choices operating system
Advanced Topics in Exception Handling Techniques
Modeling and cost analysis of nested software rejuvenation policy
ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part III
A dynamic mechanism for recovering from buffer overflow attacks
ISC'05 Proceedings of the 8th international conference on Information Security
Can dynamic provisioning and rejuvenation systems coexist in peace?
DSOM'05 Proceedings of the 16th IFIP/IEEE Ambient Networks international conference on Distributed Systems: operations and Management
Towards service awareness and autonomic features in a SIP-Enabled network
WAC'05 Proceedings of the Second international IFIP conference on Autonomic Communication
Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
International Journal of Grid and Utility Computing
Hi-index | 0.01 |
Abstract: Even after decades of software engineering research, complex computer systems still fail, primarily due to nondeterministic bugs that are typically resolved by rebooting. Conceding that Heisenbugs will remain a fact of life, we propose a systematic investigation of restarts as "high availability medicine." In this paper we show how recursive restartability (RR) - the ability of a system to gracefully tolerate restarts at multiple levels - improves fault tolerance, reduces time-to-repair, and enables system designers to build flexible, highly available software infrastructures. Using several examples of widely deployed software systems, we identify properties that are required of RR systems and outline an agenda for turning the recursive restartability philosophy into a practical software structuring tool. Finally, we describe infrastructural support for RR systems, along with initial ideas on how to analyze and benchmark such systems.