ACM Transactions on Computer Systems (TOCS)
Blueprints for high availability: designing resilient distributed systems
Blueprints for high availability: designing resilient distributed systems
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Reducing Recovery Time in a Small Recursively Restartable System
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
A Dynamic Technique for Eliminating Buffer Overflow Vulnerabilities (and Other Memory Errors)
ACSAC '04 Proceedings of the 20th Annual Computer Security Applications Conference
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Building a reactive immune system for software services
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Leveraging SDN layering to systematically troubleshoot networks
Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
Hi-index | 0.01 |
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely speculating on program execution, and requiring a long recovery time. This paper proposes an innovative, safe technique, called Rx, that can quickly recover programs from many types of common software bugs, both deterministic and non-deterministic. Our idea, inspired by allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.