Rx: treating bugs as allergies---a safe method to survive software failures

Authors:
Feng Qin;Joseph Tucek;Jagadeesan Sundaresan;Yuanyuan Zhou
Affiliations:
University of Illinois at Urbana Champaign;University of Illinois at Urbana Champaign;University of Illinois at Urbana Champaign;University of Illinois at Urbana Champaign
Venue:
Proceedings of the twentieth ACM symposium on Operating systems principles
Year:
2005

Citing 45
Cited 99

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Simulating reactive systems by deduction

ACM Transactions on Software Engineering and Methodology (TOSEM)
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Replay for concurrent non-deterministic shared-memory applications

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Free transactions with Rio Vista

Proceedings of the sixteenth ACM symposium on Operating systems principles
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
httperf—a tool for measuring web server performance

ACM SIGMETRICS Performance Evaluation Review
Deciding when to forget in the Elephant file system

Proceedings of the seventeenth ACM symposium on Operating systems principles
Blueprints for high availability: designing resilient distributed systems

Blueprints for high availability: designing resilient distributed systems
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Building Secure and Reliable Network Applications

Building Secure and Reliable Network Applications
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
How to Own the Internet in Your Spare Time

Proceedings of the 11th USENIX Security Symposium
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Reducing Recovery Time in a Small Recursively Restartable System

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
CCured in the real world

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State

ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Remote Repair of Operating System State Using Backdoors

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Building a reactive immune system for software services

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive recovery in a Byzantine-fault-tolerant system

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service

WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7

DieHard: probabilistic memory safety for unsafe languages

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Selective early request termination for busy internet services

Proceedings of the 15th international conference on World Wide Web
HeapMD: identifying heap-based bugs using anomaly detection

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
ExecRecorder: VM-based full-system replay for attack analysis and system recovery

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Dynamic slicing long running programs through execution fast forwarding

Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering
Speculative execution in a distributed file system

ACM Transactions on Computer Systems (TOCS)
Exterminator: automatically correcting memory errors with high probability

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Correlating multi-session attacks via replay

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Automatic on-line failure diagnosis at the end-user site

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Sweeper: a lightweight end-to-end system for defending against fast worms

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Discrete control for safe execution of IT automation workflows

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Enabling tracing Of long-running multithreaded programs via dynamic execution reduction

Proceedings of the 2007 international symposium on Software testing and analysis
Efficient checkpointing of java software using context-sensitive capture and replay

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Bouncer: securing software by blocking bad input

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
AutoBash: improving configuration management with operating system causality analysis

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
An approach to detecting failures automatically

Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
Towards design for self-healing

Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
Tracking bad apples: reporting the origin of null and undefined value errors

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Rethink the sync

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Archipelago: trading address space for reliability and security

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallelizing security checks on commodity hardware

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Better bug reporting with better privacy

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Samurai: protecting critical data in unsafe languages

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Switchblade: enforcing dynamic personalized system call models

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Live monitoring: using adaptive instrumentation and analysis to debug and maintain web applications

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
From STEM to SEAD: speculative execution for automated defense

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Transparent checkpoint-restart of multiple processes on commodity operating systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Runtime failure detection

Companion of the 30th international conference on Software engineering
Flexible Hardware Acceleration for Instruction-Grain Program Monitoring

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Rethink the sync

ACM Transactions on Computer Systems (TOCS)
Diverse replication for single-machine Byzantine-fault tolerance

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Exterminator: Automatically correcting memory errors with high probability

Communications of the ACM - Surviving the data deluge
Tolerating memory leaks

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Vigilante: End-to-end containment of Internet worm epidemics

ACM Transactions on Computer Systems (TOCS)
Efficiently tracking application interactions using lightweight virtualization

Proceedings of the 1st ACM workshop on Virtual machine security
Using virtual machines to do cross-layer damage assessment

Proceedings of the 1st ACM workshop on Virtual machine security
Online Network Forensics for Automatic Repair Validation

IWSEC '08 Proceedings of the 3rd International Workshop on Security: Advances in Information and Computer Security
Return Value Predictability Profiles for Self---healing

IWSEC '08 Proceedings of the 3rd International Workshop on Security: Advances in Information and Computer Security
ASSURE: automatic software self-healing using rescue points

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Leak pruning

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
First-aid: surviving and preventing memory management bugs during production runs

Proceedings of the 4th ACM European conference on Computer systems
Transparent checkpoints of closed distributed systems in Emulab

Proceedings of the 4th ACM European conference on Computer systems
A systematic approach to system state restoration during storage controller micro-recovery

FAST '09 Proccedings of the 7th conference on File and storage technologies
FlashBox: a system for logging non-deterministic events in deployed embedded systems

Proceedings of the 2009 ACM symposium on Applied Computing
Self-recovery in server programs

Proceedings of the 2009 international symposium on Memory management
A randomized dynamic program analysis technique for detecting real deadlocks

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
In-field healing of integration problems with COTS components

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
A case for an interleaving constrained shared-memory multi-processor

Proceedings of the 36th annual international symposium on Computer architecture
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Tolerating latency in replicated state machines through client speculation

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Dynamic Software Updates for Accelerating Scientific Discovery

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Automatic Generation of Runtime Failure Detectors from Property Templates

Software Engineering for Self-Adaptive Systems
Self-healing: science, engineering, and fiction

NSPW '07 Proceedings of the 2007 Workshop on New Security Paradigms
Automatically patching errors in deployed software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Availability-sensitive intrusion recovery

Proceedings of the 1st ACM workshop on Virtual machine security
Respec: efficient online multiprocessor replayvia speculation and external determinism

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Kivati: fast detection and prevention of atomicity violations

Proceedings of the 5th European conference on Computer systems
A theory of nested speculative execution

COORDINATION'07 Proceedings of the 9th international conference on Coordination models and languages
Learning universal probabilistic models for fault localization

Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Towards understanding bugs in open source router software

ACM SIGCOMM Computer Communication Review
Membrane: Operating system support for restartable file systems

ACM Transactions on Storage (TOS)
Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with multi-core processors

IBM Journal of Research and Development
Avoiding deadlock avoidance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DieHarder: securing the heap

Proceedings of the 17th ACM conference on Computer and communications security
Tolerating Concurrency Bugs Using Transactions as Lifeguards

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Recovery tasks: an automated approach to failure recovery

RV'10 Proceedings of the First international conference on Runtime verification
ConSeq: detecting concurrency bugs through sequential errors

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Software error early detection system based on run-time statistical analysis of function return values

HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
Correlating multi-session attacks via replay

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Automatic on-line failure diagnosis at the end-user site

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Ensuring content integrity for untrusted peer-to-peer content distribution networks

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
RACEZ: a lightweight and non-invasive race detection tool for production applications

Proceedings of the 33rd International Conference on Software Engineering
Quarantine: fault tolerance for concurrent servers with data-driven selective isolation

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Locating failure-inducing environment changes

Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools
DieHarder: Securing the heap

WOOT'11 Proceedings of the 5th USENIX conference on Offensive technologies
Detecting and escaping infinite loops with jolt

Proceedings of the 25th European conference on Object-oriented programming
Floguard: cost-aware systemwide intrusion defense via online forensics and on-demand IDS deployment

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Detecting and surviving data races using complementary schedules

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Making the common case the only case with anticipatory memory allocation

ACM Transactions on Storage (TOS)
Exception handling in the choices operating system

Advanced Topics in Exception Handling Techniques
Applying transactional memory to concurrency bugs

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
What to do when things go wrong: recovery in complex (computer) systems

Proceedings of the 11th annual international conference on Aspect-oriented Software Development Companion
Can deterministic replay be an enabling tool for mobile computing?

Proceedings of the 12th Workshop on Mobile Computing Systems and Applications
Bolt: on-demand infinite loop escape in unmodified binaries

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Katana: Towards Patching as a Runtime Part of the Compiler-Linker-Loader Toolchain

International Journal of Secure Software Engineering
ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Towards hinted collection: annotations for decreasing garbage collector pause times

Proceedings of the 2013 international symposium on memory management
Concurrency bugs in multithreaded software: modeling and analysis using Petri nets

Discrete Event Dynamic Systems
Safe software updates via multi-version execution

Proceedings of the 2013 International Conference on Software Engineering
Exception handlers for healing component-based systems

ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
Leveraging the short-term memory of hardware to diagnose production-run software failures

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.