Remus: high availability via asynchronous virtual machine replication

  • Authors:
  • Brendan Cully;Geoffrey Lefebvre;Dutch Meyer;Mike Feeley;Norm Hutchinson;Andrew Warfield

  • Affiliations:
  • Department of Computer Science, The University of British Columbia;Department of Computer Science, The University of British Columbia;Department of Computer Science, The University of British Columbia;Department of Computer Science, The University of British Columbia;Department of Computer Science, The University of British Columbia;Department of Computer Science, The University of British Columbia

  • Venue:
  • NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Allowing applications to survive hardware failure is an expensive undertaking, which generally involves reengineering software to include complicated recovery logic as well as deploying special-purpose hardware; this represents a severe barrier to improving the dependability of large or legacy applications. We describe the construction of a general and transparent high availability service that allows existing, unmodified software to be protected from the failure of the physical machine on which it runs. Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections. Our approach encapsulates protected software in a virtual machine, asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, and uses speculative execution to concurrently run the active VM slightly ahead of the replicated system state.