The design of a practical system for fault-tolerant virtual machines

Authors:
Daniel J. Scales;Mike Nelson;Ganesh Venkitachalam
Affiliations:
VMware, Inc.;VMware, Inc.;VMware, Inc.
Venue:
ACM SIGOPS Operating Systems Review
Year:
2010

Citing 9
Cited 6

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
TFT: A Software System for Application-Transparent Fault Tolerance

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A principle for resilient sharing of distributed resources

ICSE '76 Proceedings of the 2nd international conference on Software engineering
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fast transparent migration for virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation

The evolution of an x86 virtual machine monitor

ACM SIGOPS Operating Systems Review
Challenges in building scalable virtualized datacenter management

ACM SIGOPS Operating Systems Review
Enhancing TCP throughput of highly available virtual machines via speculative communication

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Utilizing memory content similarity for improving the performance of highly available virtual machines

Future Generation Computer Systems
kMemvisor: flexible system wide memory mirroring in virtual environments

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines, based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server. We have designed a complete system in VMware vSphere 4.0 that is easy to use, runs on commodity servers, and typically reduces performance of real applications by less than 10%. In addition, the data bandwidth needed to keep the primary and secondary VM executing in lockstep is less than 20 Mbit/s for several real applications, which allows for the possibility of implementing fault tolerance over longer distances. An easy-to-use, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encountered in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide performance results for both micro-benchmarks and real applications