The design of a practical system for fault-tolerant virtual machines

  • Authors:
  • Daniel J. Scales;Mike Nelson;Ganesh Venkitachalam

  • Affiliations:
  • VMware, Inc.;VMware, Inc.;VMware, Inc.

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines, based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server. We have designed a complete system in VMware vSphere 4.0 that is easy to use, runs on commodity servers, and typically reduces performance of real applications by less than 10%. In addition, the data bandwidth needed to keep the primary and secondary VM executing in lockstep is less than 20 Mbit/s for several real applications, which allows for the possibility of implementing fault tolerance over longer distances. An easy-to-use, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encountered in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide performance results for both micro-benchmarks and real applications