Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Practical byzantine fault tolerance and proactive recovery
ACM Transactions on Computer Systems (TOCS)
TFT: A Software System for Application-Transparent Fault Tolerance
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
HotSwap-Transparent Server Failover for Linux
LISA '02 Proceedings of the 16th USENIX conference on System administration
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Hypervisor-Based Efficient Proactive Recovery
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Execution replay of multiprocessor virtual machines
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Accelerating two-dimensional page walks for virtualized systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
CoRAL: A transparent fault-tolerant web service
Journal of Systems and Software
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures
IEEE Transactions on Dependable and Secure Computing
ODR: output-deterministic replay for multicore debugging
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Transparent, lightweight application execution replay on commodity multiprocessor operating systems
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The design of a practical system for fault-tolerant virtual machines
ACM SIGOPS Operating Systems Review
The case for determinism in database systems
Proceedings of the VLDB Endowment
Dthreads: efficient deterministic multithreading
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
A Flexible Approach to Improving System Reliability with Virtual Lockstep
IEEE Transactions on Dependable and Secure Computing
Efficient system-enforced deterministic parallelism
Communications of the ACM
All about Eve: execute-verify replication for multi-core servers
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Hi-index | 0.00 |
Virtual machine (VM) replication provides a software solution of for business continuity and disaster recovery through application-agnostic hardware fault tolerance by replicating the state of primary VM (PVM) to secondary VM (SVM) on a different physical node. Unfortunately, current VM replication approaches suffer from excessive overhead, which severely limit their applicability and suitability. In this paper, we leverage the practical effect of networked server-client system that PVM and SVM are considered as in the same state only if they can generate the same response from the clients' point of view, and this is exploited to optimize performance. To this end, we propose a generic and highly efficient non-stop service solution, named as "COLO" (COarse-grained LOck-stepping virtual machine) utilizing on-demand VM replication. COLO monitors the output responses of the PVM and SVM, and rules the SVM as a valid replica of the PVM according to the output similarity between PVM and SVM. If the responses do not match, the commit of network response is withheld until PVM's state has been synchronized to SVM. Hence, we ensure that the system is always capable of failover by SVM. Although non-determinism may mean a different internal state of SVM from that of the PVM, it is equally valid and remains consistent from external observations. Unlike earlier instruction level lock-stepping deterministic execution approaches, COLO can easily support Multi-Processors (MP) involving workloads with the satisfying performance. Results show that COLO significantly outperforms existing approaches, particularly on server-client workloads such as online databases and web server applications.