Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Simulating reactive systems by deduction
ACM Transactions on Software Engineering and Methodology (TOSEM)
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Replay for concurrent non-deterministic shared-memory applications
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Free transactions with Rio Vista
Proceedings of the sixteenth ACM symposium on Operating systems principles
IEEE Transactions on Parallel and Distributed Systems
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
httperf—a tool for measuring web server performance
ACM SIGMETRICS Performance Evaluation Review
Deciding when to forget in the Elephant file system
Proceedings of the seventeenth ACM symposium on Operating systems principles
Blueprints for high availability: designing resilient distributed systems
Blueprints for high availability: designing resilient distributed systems
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
BASE: using abstraction to improve fault tolerance
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Building Secure and Reliable Network Applications
Building Secure and Reliable Network Applications
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
How to Own the Internet in Your Spare Time
Proceedings of the 11th USENIX Security Symposium
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Reducing Recovery Time in a Small Recursively Restartable System
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State
ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Remote Repair of Operating System State Using Backdoors
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Building a reactive immune system for software services
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive recovery in a Byzantine-fault-tolerant system
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalability of the microsoft cluster service
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
DieHard: probabilistic memory safety for unsafe languages
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Selective early request termination for busy internet services
Proceedings of the 15th international conference on World Wide Web
HeapMD: identifying heap-based bugs using anomaly detection
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
ExecRecorder: VM-based full-system replay for attack analysis and system recovery
Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Dynamic slicing long running programs through execution fast forwarding
Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering
Speculative execution in a distributed file system
ACM Transactions on Computer Systems (TOCS)
Exterminator: automatically correcting memory errors with high probability
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Correlating multi-session attacks via replay
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Automatic on-line failure diagnosis at the end-user site
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Sweeper: a lightweight end-to-end system for defending against fast worms
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Discrete control for safe execution of IT automation workflows
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Enabling tracing Of long-running multithreaded programs via dynamic execution reduction
Proceedings of the 2007 international symposium on Software testing and analysis
Efficient checkpointing of java software using context-sensitive capture and replay
Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Bouncer: securing software by blocking bad input
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Triage: diagnosing production run failures at the user's site
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
AutoBash: improving configuration management with operating system causality analysis
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
An approach to detecting failures automatically
Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
Towards design for self-healing
Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
Tracking bad apples: reporting the origin of null and undefined value errors
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Archipelago: trading address space for reliability and security
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallelizing security checks on commodity hardware
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Better bug reporting with better privacy
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Samurai: protecting critical data in unsafe languages
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Switchblade: enforcing dynamic personalized system call models
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Live monitoring: using adaptive instrumentation and analysis to debug and maintain web applications
HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
From STEM to SEAD: speculative execution for automated defense
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Transparent checkpoint-restart of multiple processes on commodity operating systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Companion of the 30th international conference on Software engineering
Flexible Hardware Acceleration for Instruction-Grain Program Monitoring
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
ACM Transactions on Computer Systems (TOCS)
Diverse replication for single-machine Byzantine-fault tolerance
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Exterminator: Automatically correcting memory errors with high probability
Communications of the ACM - Surviving the data deluge
Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Vigilante: End-to-end containment of Internet worm epidemics
ACM Transactions on Computer Systems (TOCS)
Efficiently tracking application interactions using lightweight virtualization
Proceedings of the 1st ACM workshop on Virtual machine security
Using virtual machines to do cross-layer damage assessment
Proceedings of the 1st ACM workshop on Virtual machine security
Online Network Forensics for Automatic Repair Validation
IWSEC '08 Proceedings of the 3rd International Workshop on Security: Advances in Information and Computer Security
Return Value Predictability Profiles for Self---healing
IWSEC '08 Proceedings of the 3rd International Workshop on Security: Advances in Information and Computer Security
ASSURE: automatic software self-healing using rescue points
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
First-aid: surviving and preventing memory management bugs during production runs
Proceedings of the 4th ACM European conference on Computer systems
Transparent checkpoints of closed distributed systems in Emulab
Proceedings of the 4th ACM European conference on Computer systems
A systematic approach to system state restoration during storage controller micro-recovery
FAST '09 Proccedings of the 7th conference on File and storage technologies
FlashBox: a system for logging non-deterministic events in deployed embedded systems
Proceedings of the 2009 ACM symposium on Applied Computing
Self-recovery in server programs
Proceedings of the 2009 international symposium on Memory management
A randomized dynamic program analysis technique for detecting real deadlocks
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
In-field healing of integration problems with COTS components
ICSE '09 Proceedings of the 31st International Conference on Software Engineering
A case for an interleaving constrained shared-memory multi-processor
Proceedings of the 36th annual international symposium on Computer architecture
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Tolerating latency in replicated state machines through client speculation
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Dynamic Software Updates for Accelerating Scientific Discovery
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Automatic Generation of Runtime Failure Detectors from Property Templates
Software Engineering for Self-Adaptive Systems
Self-healing: science, engineering, and fiction
NSPW '07 Proceedings of the 2007 Workshop on New Security Paradigms
Automatically patching errors in deployed software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Debugging in the (very) large: ten years of implementation and experience
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
International Journal of High Performance Computing Applications
Availability-sensitive intrusion recovery
Proceedings of the 1st ACM workshop on Virtual machine security
Respec: efficient online multiprocessor replayvia speculation and external determinism
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Kivati: fast detection and prevention of atomicity violations
Proceedings of the 5th European conference on Computer systems
A theory of nested speculative execution
COORDINATION'07 Proceedings of the 9th international conference on Coordination models and languages
Learning universal probabilistic models for fault localization
Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Towards understanding bugs in open source router software
ACM SIGCOMM Computer Communication Review
Membrane: Operating system support for restartable file systems
ACM Transactions on Storage (TOS)
IBM Journal of Research and Development
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 17th ACM conference on Computer and communications security
Tolerating Concurrency Bugs Using Transactions as Lifeguards
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Recovery tasks: an automated approach to failure recovery
RV'10 Proceedings of the First international conference on Runtime verification
ConSeq: detecting concurrency bugs through sequential errors
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
Correlating multi-session attacks via replay
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Automatic on-line failure diagnosis at the end-user site
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Ensuring content integrity for untrusted peer-to-peer content distribution networks
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
RACEZ: a lightweight and non-invasive race detection tool for production applications
Proceedings of the 33rd International Conference on Software Engineering
Quarantine: fault tolerance for concurrent servers with data-driven selective isolation
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Locating failure-inducing environment changes
Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools
WOOT'11 Proceedings of the 5th USENIX conference on Offensive technologies
Detecting and escaping infinite loops with jolt
Proceedings of the 25th European conference on Object-oriented programming
Floguard: cost-aware systemwide intrusion defense via online forensics and on-demand IDS deployment
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Detecting and surviving data races using complementary schedules
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Making the common case the only case with anticipatory memory allocation
ACM Transactions on Storage (TOS)
Exception handling in the choices operating system
Advanced Topics in Exception Handling Techniques
Applying transactional memory to concurrency bugs
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
What to do when things go wrong: recovery in complex (computer) systems
Proceedings of the 11th annual international conference on Aspect-oriented Software Development Companion
Can deterministic replay be an enabling tool for mobile computing?
Proceedings of the 12th Workshop on Mobile Computing Systems and Applications
Bolt: on-demand infinite loop escape in unmodified binaries
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Katana: Towards Patching as a Runtime Part of the Compiler-Linker-Loader Toolchain
International Journal of Secure Software Engineering
ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Towards hinted collection: annotations for decreasing garbage collector pause times
Proceedings of the 2013 international symposium on memory management
Concurrency bugs in multithreaded software: modeling and analysis using Petri nets
Discrete Event Dynamic Systems
Safe software updates via multi-version execution
Proceedings of the 2013 International Conference on Software Engineering
Exception handlers for healing component-based systems
ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
Leveraging the short-term memory of hardware to diagnose production-run software failures
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.