Rx: Treating bugs as allergies—a safe method to survive software failures

Authors:
Feng Qin;Joseph Tucek;Yuanyuan Zhou;Jagadeesan Sundaresan
Affiliations:
The Ohio State University, Columbus, OH;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;Citadel Investment Group, Chicago, IL
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2007

Citing 54
Cited 19

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Data Diversity: An Approach to Software Fault Tolerance

IEEE Transactions on Computers - Fault-Tolerant Computing
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Simulating reactive systems by deduction

ACM Transactions on Software Engineering and Methodology (TOSEM)
World-Wide Web proxies

Selected papers of the first conference on World-Wide Web
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Replay for concurrent non-deterministic shared-memory applications

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Free transactions with Rio Vista

Proceedings of the sixteenth ACM symposium on Operating systems principles
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
httperf—a tool for measuring web server performance

ACM SIGMETRICS Performance Evaluation Review
Deciding when to forget in the Elephant file system

Proceedings of the seventeenth ACM symposium on Operating systems principles
A static analyzer for finding dynamic programming errors

Software—Practice & Experience
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A system and language for building system-specific, static analyses

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Building Secure and Reliable Network Applications

Building Secure and Reliable Network Applications
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
How to Own the Internet in Your Spare Time

Proceedings of the 11th USENIX Security Symposium
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Reducing Recovery Time in a Small Recursively Restartable System

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A practical flow-sensitive and context-sensitive C and C++ memory leak detector

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
CCured in the real world

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State

ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes

Proceedings of the 30th annual international symposium on Computer architecture
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Remote Repair of Operating System State Using Backdoors

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
DieHard: probabilistic memory safety for unsafe languages

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Have things changed now?: an empirical study of bug characteristics in modern open source software

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Building a reactive immune system for software services

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive recovery in a Byzantine-fault-tolerant system

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service

WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
The N-Version Approach to Fault-Tolerant Software

IEEE Transactions on Software Engineering

SigRace: signature-based data race detection

Proceedings of the 36th annual international symposium on Computer architecture
Surviving sensor network software faults

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Tamper-Tolerant Software: Modeling and Implementation

IWSEC '09 Proceedings of the 4th International Workshop on Security: Advances in Information and Computer Security
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
Statistically regulating program behavior via mainstream computing

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Adaptive bug isolation

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Gadara: dynamic deadlock avoidance for multithreaded programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Deadlock immunity: enabling systems to defend against deadlocks

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Automatic workarounds for web applications

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Using allopoietic agents in replicated software to respond to errors, faults, and attacks

Proceedings of the 48th Annual Southeast Regional Conference
Bypassing races in live applications with execution filters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
A type and effect system for deadlock avoidance in low-level languages

Proceedings of the 7th ACM SIGPLAN workshop on Types in language design and implementation
Supporting component-based failover units in middleware for distributed real-time and embedded systems

Journal of Systems Architecture: the EUROMICRO Journal
Towards dependable clients: improving the reliability and availability of the browsers

Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
Automatic recovery from runtime failures

Proceedings of the 2013 International Conference on Software Engineering
A framework for self-healing software systems

Proceedings of the 2013 International Conference on Software Engineering
Preventing database deadlocks in applications

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time. This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and nondeterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the “allergen” from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis. We have implemented Rx on Linux. Our experiments with five server applications that contain seven bugs of various types show that Rx can survive six out of seven software failures and provide transparent fast recovery within 0.017--0.16 seconds, 21--53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and reexecution without environmental changes, cannot successfully recover the four servers (Squid, Apache, CVS, and ypserv) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a nondeterministic concurrency bug. Additionally, Rx's checkpointing system is lightweight, imposing small time and space overheads.