Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The Rio file cache: surviving operating system crashes
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Free transactions with Rio Vista
Proceedings of the sixteenth ACM symposium on Operating systems principles
Efficient transparent application recovery in client-server information systems
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Maximum and minimum consistent global checkpoints and their applications
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Theory and practice of failure transparency
Theory and practice of failure transparency
An evaluation of the recovery-related properties of software faults
An evaluation of the recovery-related properties of software faults
NT-SwiFT: software implemented fault tolerance on windows NT
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
The Design and Verification of the Rio File Cache
IEEE Transactions on Computers
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Finding and preventing run-time error handling mistakes
OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Susceptibility of Commodity Systems and Software to Memory Soft Errors
IEEE Transactions on Computers
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Autonomous recovery in componentized Internet applications
Cluster Computing
Rewind, repair, replay: three R's to dependability
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
ACM Transactions on Computer Systems (TOCS)
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Correlating multi-session attacks via replay
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Timeline: a high performance archive for a distributed object store
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Toward recovery-oriented computing
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Exceptional situations and program reliability
ACM Transactions on Programming Languages and Systems (TOPLAS)
Debugging debugged, a metaphysical manifesto of systems integration
ACM SIGSOFT Software Engineering Notes
ACM Transactions on Computer Systems (TOCS)
Recoverable class loaders for a fast restart of Java applications
Mobile Networks and Applications
Recovery domains: an organizing principle for recoverable operating systems
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Respec: efficient online multiprocessor replayvia speculation and external determinism
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Otherworld: giving applications a chance to survive OS kernel crashes
Proceedings of the 5th European conference on Computer systems
"Otherworld": giving applications a chance to survive OS kernel crashes
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Correlating multi-session attacks via replay
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Operating system implications of fast, cheap, non-volatile memory
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Detecting and surviving data races using complementary schedules
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Composable reliability for asynchronous systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Hi-index | 0.01 |
We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90% of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases.