Optimizing VM checkpointing for restore performance in VMware ESXi

  • Authors:
  • Irene Zhang;Tyler Denniston;Yury Baskakov;Alex Garthwaite

  • Affiliations:
  • University of Washington;MIT CSAIL;VMware;CloudPhysics

  • Venue:
  • USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cloud providers are increasingly looking to use virtual machine checkpointing for new applications beyond fault tolerance. Existing checkpointing systems designed for fault tolerance only optimize for saving checkpointed state, so they cannot support these new applications, which require better restore performance. Improving restore performance requires a predictive technique to reduce the number of disk accesses to bring in the VM's memory on restore. However, complex VM workloads can diverge at any time due to external inputs, background processes, and timing variation, so predicting which pages the VM will access on restore to reduce faults to disk is impossible. Instead, we focus on predicting which pages the VM will access together on restore to improve the efficiency of disk accesses. To reduce the number of faults to disk on restore, we group memory pages likely to be accessed together into locality blocks. On each fault, we can load a block of pages that are likely to be accessed with the faulting page, eliminating future faults and increasing disk efficiency. We implement support for locality blocks, along with several other optimizations, in a new checkpointing system for VMware ESXi Server called Halite. Our experiments show that Halite reduces restore overhead by up to 94% for a range of workloads.