Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Data Diversity: An Approach to Software Fault Tolerance
IEEE Transactions on Computers - Fault-Tolerant Computing
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Simulating reactive systems by deduction
ACM Transactions on Software Engineering and Methodology (TOSEM)
Selected papers of the first conference on World-Wide Web
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Replay for concurrent non-deterministic shared-memory applications
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Eraser: a dynamic data race detector for multithreaded programs
ACM Transactions on Computer Systems (TOCS)
Free transactions with Rio Vista
Proceedings of the sixteenth ACM symposium on Operating systems principles
IEEE Transactions on Parallel and Distributed Systems
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
httperf—a tool for measuring web server performance
ACM SIGMETRICS Performance Evaluation Review
Deciding when to forget in the Elephant file system
Proceedings of the seventeenth ACM symposium on Operating systems principles
A static analyzer for finding dynamic programming errors
Software—Practice & Experience
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
BASE: using abstraction to improve fault tolerance
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A system and language for building system-specific, static analyses
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Building Secure and Reliable Network Applications
Building Secure and Reliable Network Applications
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
How to Own the Internet in Your Spare Time
Proceedings of the 11th USENIX Security Symposium
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Reducing Recovery Time in a Small Recursively Restartable System
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A practical flow-sensitive and context-sensitive C and C++ memory leak detector
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State
ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes
Proceedings of the 30th annual international symposium on Computer architecture
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Remote Repair of Operating System State Using Backdoors
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
DieHard: probabilistic memory safety for unsafe languages
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Have things changed now?: an empirical study of bug characteristics in modern open source software
Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Building a reactive immune system for software services
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive recovery in a Byzantine-fault-tolerant system
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
The N-Version Approach to Fault-Tolerant Software
IEEE Transactions on Software Engineering
SigRace: signature-based data race detection
Proceedings of the 36th annual international symposium on Computer architecture
Surviving sensor network software faults
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Tamper-Tolerant Software: Modeling and Implementation
IWSEC '09 Proceedings of the 4th International Workshop on Security: Advances in Information and Computer Security
Predicting and preventing inconsistencies in deployed distributed systems
ACM Transactions on Computer Systems (TOCS)
Statistically regulating program behavior via mainstream computing
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Gadara: dynamic deadlock avoidance for multithreaded programs
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Deadlock immunity: enabling systems to defend against deadlocks
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Automatic workarounds for web applications
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Using allopoietic agents in replicated software to respond to errors, faults, and attacks
Proceedings of the 48th Annual Southeast Regional Conference
Bypassing races in live applications with execution filters
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
A type and effect system for deadlock avoidance in low-level languages
Proceedings of the 7th ACM SIGPLAN workshop on Types in language design and implementation
Journal of Systems Architecture: the EUROMICRO Journal
Towards dependable clients: improving the reliability and availability of the browsers
Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
Automatic recovery from runtime failures
Proceedings of the 2013 International Conference on Software Engineering
A framework for self-healing software systems
Proceedings of the 2013 International Conference on Software Engineering
Preventing database deadlocks in applications
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Hi-index | 0.00 |
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time. This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and nondeterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the “allergen” from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis. We have implemented Rx on Linux. Our experiments with five server applications that contain seven bugs of various types show that Rx can survive six out of seven software failures and provide transparent fast recovery within 0.017--0.16 seconds, 21--53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and reexecution without environmental changes, cannot successfully recover the four servers (Squid, Apache, CVS, and ypserv) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a nondeterministic concurrency bug. Additionally, Rx's checkpointing system is lightweight, imposing small time and space overheads.