Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Debugging with dynamic slicing and backtracking
Software—Practice & Experience
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
The design and implementation of Zap: a system for migrating computing environments
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An API for Runtime Code Patching
International Journal of High Performance Computing Applications
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatically Finding and Patching Bad Error Handling
EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Debugging operating systems with time-traveling virtual machines
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Building a reactive immune system for software services
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
How Long Will It Take to Fix This Bug?
MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Dynamic and adaptive updates of non-quiescent subsystems in commodity operating system kernels
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Preventing Memory Error Exploits with WIT
SP '08 Proceedings of the 2008 IEEE Symposium on Security and Privacy
ASSURE: automatic software self-healing using rescue points
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Automatically patching errors in deployed software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
A few billion lines of code later: using static analysis to find bugs in the real world
Communications of the ACM
Information Systems Research
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Fast and practical instruction-set randomization for commodity systems
Proceedings of the 26th Annual Computer Security Applications Conference
REASSURE: a self-contained mechanism for healing software using rescue points
IWSEC'11 Proceedings of the 6th International conference on Advances in information and computer security
libdft: practical dynamic data flow tracking for commodity systems
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Smashing the Gadgets: Hindering Return-Oriented Programming Using In-place Code Randomization
SP '12 Proceedings of the 2012 IEEE Symposium on Security and Privacy
Chronicler: lightweight recording to reproduce field failures
Proceedings of the 2013 International Conference on Software Engineering
Hi-index | 0.00 |
Software bugs and vulnerabilities cause serious problems to both home users and the Internet infrastructure, limiting the availability of Internet services, causing loss of data, and reducing system integrity. Software self-healing using rescue points (RPs) is a known mechanism for recovering from unforeseen errors. However, applying it on multitier architectures can be problematic because certain actions, like transmitting data over the network, cannot be undone. We propose cascading rescue points (CRPs) to address the state inconsistency issues that can arise when using traditional RPs to recover from errors in interconnected applications. With CRPs, when an application executing within a RP transmits data, the remote peer is notified to also perform a checkpoint, so the communicating entities checkpoint in a coordinated, but loosely coupled way. Notifications are also sent when RPs successfully complete execution, and when recovery is initiated, so that the appropriate action is performed by remote parties. We developed a tool that implements CRPs by dynamically instrumenting binaries and transparently injecting notifications in the already established TCP channels between applications. We tested our tool with various applications, including the MySQL and Apache servers, and show that it allows them to successfully recover from errors, while incurring moderate overhead between 4.54% and 71.56%.