COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Authors:
YaoZu Dong;Wei Ye;YunHong Jiang;Ian Pratt;ShiQing Ma;Jian Li;HaiBing Guan
Affiliations:
Shanghai Jiao Tong University, China and Intel Asia-Pacific R&D Ltd., China;Shanghai Jiao Tong University, China and Intel Asia-Pacific R&D Ltd., China;Shanghai Jiao Tong University, China;Bromium Inc.;Shanghai Jiao Tong University, China and Intel Asia-Pacific R&D Ltd., China;Shanghai Jiao Tong University, China;Shanghai Jiao Tong University, China
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 25
Cited 0

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Practical byzantine fault tolerance and proactive recovery

ACM Transactions on Computer Systems (TOCS)
TFT: A Software System for Application-Transparent Fault Tolerance

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
HotSwap-Transparent Server Failover for Linux

LISA '02 Proceedings of the 16th USENIX conference on System administration
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Hypervisor-Based Efficient Proactive Recovery

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Execution replay of multiprocessor virtual machines

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Accelerating two-dimensional page walks for virtualized systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
CoRAL: A transparent fault-tolerant web service

Journal of Systems and Software
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

IEEE Transactions on Dependable and Secure Computing
ODR: output-deterministic replay for multicore debugging

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Transparent, lightweight application execution replay on commodity multiprocessor operating systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The design of a practical system for fault-tolerant virtual machines

ACM SIGOPS Operating Systems Review
The case for determinism in database systems

Proceedings of the VLDB Endowment
Dthreads: efficient deterministic multithreading

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
A Flexible Approach to Improving System Reliability with Virtual Lockstep

IEEE Transactions on Dependable and Secure Computing
Efficient system-enforced deterministic parallelism

Communications of the ACM
All about Eve: execute-verify replication for multi-core servers

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Virtual machine (VM) replication provides a software solution of for business continuity and disaster recovery through application-agnostic hardware fault tolerance by replicating the state of primary VM (PVM) to secondary VM (SVM) on a different physical node. Unfortunately, current VM replication approaches suffer from excessive overhead, which severely limit their applicability and suitability. In this paper, we leverage the practical effect of networked server-client system that PVM and SVM are considered as in the same state only if they can generate the same response from the clients' point of view, and this is exploited to optimize performance. To this end, we propose a generic and highly efficient non-stop service solution, named as "COLO" (COarse-grained LOck-stepping virtual machine) utilizing on-demand VM replication. COLO monitors the output responses of the PVM and SVM, and rules the SVM as a valid replica of the PVM according to the output similarity between PVM and SVM. If the responses do not match, the commit of network response is withheld until PVM's state has been synchronized to SVM. Hence, we ensure that the system is always capable of failover by SVM. Although non-determinism may mean a different internal state of SVM from that of the PVM, it is equally valid and remains consistent from external observations. Unlike earlier instruction level lock-stepping deterministic execution approaches, COLO can easily support Multi-Processors (MP) involving workloads with the satisfying performance. Results show that COLO significantly outperforms existing approaches, particularly on server-client workloads such as online databases and web server applications.