HotSnap: a hot distributed snapshot system for virtual machine cluster

Authors:
Lei Cui;Bo Li;Yangyang Zhang;Jianxin Li
Affiliations:
Beihang University, Beijing, China;Beihang University, Beijing, China;Beihang University, Beijing, China;Beihang University, Beijing, China
Venue:
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Year:
2013

Citing 10
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
On distributed snapshots

Information Processing Letters
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Transparent checkpoints of closed distributed systems in Emulab

Proceedings of the 4th ACM European conference on Computer systems
CyberGuarder: A virtualization security assurance architecture for green cloud computing

Future Generation Computer Systems
Parallelizing live migration of virtual machines

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
iROW: An Efficient Live Snapshot System for Virtual Machine Disk

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The management of virtual machine cluster (VMC) is challenging owing to the reliability requirements, such as non-stop service, failure tolerance, etc. Distributed snapshot of VMC is one promising approach to support system reliability, it allows the system administrators of data centers to recover the system from failure, and resume the execution from a intermediate state rather than the initial state. However, due to the heavyweight nature of virtual machine (VM) technology, applications running in the VMC suffer from long downtime and performance degradation during snapshot. Besides, the discrepancy of snapshot completion times among VMs brings the TCP backoff problem, resulting in network interruption between two communicating VMs. This paper proposes HotSnap, a VMC snapshot approach designed to enable taking hot distributed snapshot with milliseconds system downtime and TCP backoff duration. At the core of HotSnap is transient snapshot that saves the minimum instantaneous state in a short time, and full snapshot which saves the entire VM state during normal operation. We then design the snapshot protocol to coordinate the individual VM snapshots into the global consistent state of VMC. We have implemented HotSnap on QEMU/KVM, and conduct several experiments to show the effectiveness and efficiency. Compared to the live migration based distributed snapshot technique which brings seconds of system downtime and network interruption, HotSnap only incurs tens of milliseconds.