The packer filter: an efficient mechanism for user-level network code
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
The programming language Oberon
Software—Practice & Experience
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Programming perl
The design and implementation of a log-structured file system
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The synchronization of periodic routing messages
IEEE/ACM Transactions on Networking (TON)
The mythical man-month (anniversary ed.)
The mythical man-month (anniversary ed.)
A reliable multicast framework for light-weight sessions and application level framing
SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Fault-tolerant computer system design
Fault-tolerant computer system design
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Rio file cache: surviving operating system crashes
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
End-to-end routing behavior in the Internet
Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Disco: running commodity operating systems on scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
The performance of μ-kernel-based systems
Proceedings of the sixteenth ACM symposium on Operating systems principles
Cluster-based scalable network services
Proceedings of the sixteenth ACM symposium on Operating systems principles
Free transactions with Rio Vista
Proceedings of the sixteenth ACM symposium on Operating systems principles
Reliable computer systems (3rd ed.): design and evaluation
Reliable computer systems (3rd ed.): design and evaluation
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
A model, analysis, and protocol framework for soft state-based communication
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
SIMULA: an ALGOL-based simulation language
Communications of the ACM
Fast-Start: quick fault recovery in oracle
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Information and control in gray-box systems
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
An empirical study of operating systems errors
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Multitasking without comprimise: a virtual machine evolution
OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Software Fault Tolerance
Verification and Validation of Modern Software Systems
Verification and Validation of Modern Software Systems
Increasing relevance of memory hardware errors: a case for recoverable programming models
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Lessons from Giant-Scale Services
IEEE Internet Computing
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
Proceedings of the 8th International Symposium on Static Analysis
SAS '01 Proceedings of the 8th International Symposium on Static Analysis
The Design of the POSTGRES Storage System
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Notes on Data Base Operating Systems
Operating Systems, An Advanced Course
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Recovery blocks in action: A system supporting high reliability
ICSE '76 Proceedings of the 2nd international conference on Software engineering
Self-Monitoring and Self-Adapting Operating Systems
HOTOS '97 Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI)
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications
WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Scale and performance in the Denali isolation kernel
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
An analysis of internet content delivery systems
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Luna: a flexible Java protection system
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Scalable, distributed data structures for internet service construction
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective
IBM Journal of Research and Development
Fibre Channel A Comprehensive Introduction
Fibre Channel A Comprehensive Introduction
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A new approach to real-time checkpointing
Proceedings of the 2nd international conference on Virtual execution environments
Software—Practice & Experience
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Emstar: A software environment for developing and deploying heterogeneous sensor-actuator networks
ACM Transactions on Sensor Networks (TOSN)
Kernel support for zero-loss Internet service restart
Software—Practice & Experience
Software architecture reliability analysis using failure scenarios
Journal of Systems and Software
Debugging debugged, a metaphysical manifesto of systems integration
ACM SIGSOFT Software Engineering Notes
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Extending Symmetry Reduction by Exploiting System Architecture
VMCAI '09 Proceedings of the 10th International Conference on Verification, Model Checking, and Abstract Interpretation
Isolation points: Creating performance-robust enterprise systems
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Self-adaptive software: Landscape and research challenges
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Toward automatic policy refinement in repair services for large distributed systems
ACM SIGOPS Operating Systems Review
Architecting dependable systems with proactive fault management
Architecting dependable systems VII
A framework for evaluating quality-driven self-adaptive software systems
Proceedings of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing Systems
Journal of Systems and Software
Detecting failures in distributed systems with the Falcon spy network
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
I-queue: smart queues for service management
ICSOC'06 Proceedings of the 4th international conference on Service-Oriented Computing
Policy-Driven configuration and management of agent based distributed systems
Software Engineering for Multi-Agent Systems IV
Towards a goal-driven approach to action selection in self-adaptive software
Software—Practice & Experience
To increase survivability with software rejuvenation by having dual base station in WSN environment
ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Report on the fourth workshop on hot topics in software upgrades (HotSWUp 2012)
ACM SIGOPS Operating Systems Review
Optimizing decomposition of software architecture for local recovery
Software Quality Control
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures, that can typically be resolved by rebooting. Conceding that failure-free software will continue eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery.We apply recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an lnternet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as "crash-only software systems."