Optimizing VM checkpointing for restore performance in VMware ESXi

Authors:
Irene Zhang;Tyler Denniston;Yury Baskakov;Alex Garthwaite
Affiliations:
University of Washington;MIT CSAIL;VMware;CloudPhysics
Venue:
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Year:
2013

Citing 14
Cited 0

Managing energy and server resources in hosting centers

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Sequential Program Prefetching in Memory Hierarchies

Computer
Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning

Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
SnowFlock: rapid virtual machine cloning for cloud computing

Proceedings of the 4th ACM European conference on Computer systems
LiteGreen: saving energy in networked desktops using virtualization

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
A New Concurrent Checkpoint Mechanism for Real-Time and Interactive Processes

COMPSAC '10 Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications Conference
Fast and space-efficient virtual machine checkpointing

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fast restore of checkpointed memory using working set estimation

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
FAST: quick application launch on solid-state drives

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Kaleidoscope: cloud micro-elasticity via VM state coloring

Proceedings of the sixth conference on Computer systems
FlurryDB: a dynamically scalable relational database with virtual machine cloning

Proceedings of the 4th Annual International Conference on Systems and Storage
CoLT: Coalesced Large-Reach TLBs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud providers are increasingly looking to use virtual machine checkpointing for new applications beyond fault tolerance. Existing checkpointing systems designed for fault tolerance only optimize for saving checkpointed state, so they cannot support these new applications, which require better restore performance. Improving restore performance requires a predictive technique to reduce the number of disk accesses to bring in the VM's memory on restore. However, complex VM workloads can diverge at any time due to external inputs, background processes, and timing variation, so predicting which pages the VM will access on restore to reduce faults to disk is impossible. Instead, we focus on predicting which pages the VM will access together on restore to improve the efficiency of disk accesses. To reduce the number of faults to disk on restore, we group memory pages likely to be accessed together into locality blocks. On each fault, we can load a block of pages that are likely to be accessed with the faulting page, eliminating future faults and increasing disk efficiency. We implement support for locality blocks, along with several other optimizations, in a new checkpointing system for VMware ESXi Server called Halite. Our experiments show that Halite reduces restore overhead by up to 94% for a range of workloads.