ReHype: enabling VM survival across hypervisor failures

Authors:
Michael Le;Yuval Tamir
Affiliations:
UCLA Computer Science Department, Los Angeles, CA, USA;UCLA Computer Science Department, Los Angeles, CA, USA
Venue:
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Year:
2011

Citing 12
Cited 3

The Systematic Improvement of Fault Tolerance in the Rio File Cache

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Improving the reliability of commodity operating systems

ACM Transactions on Computer Systems (TOCS)
Diagnosing performance overheads in the xen virtual machine environment

Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
Virtual Machine Monitors: Current Technology and Future Trends

Computer
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Fast Rejuvenation Technique for Server Consolidation with Virtual Machines

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Maintaining Network QoS Across NIC Device Driver Failures Using Virtualization

NCA '09 Proceedings of the 2009 Eighth IEEE International Symposium on Network Computing and Applications
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
Transparent Fault Tolerance of Device Drivers for Virtual Machines

IEEE Transactions on Computers
Fast Software Rejuvenation of Virtual Machine Monitors

IEEE Transactions on Dependable and Secure Computing

Breaking up is hard to do: security and functionality in a commodity hypervisor

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Trends and challenges in operating systems---from parallel computing to cloud computing

Concurrency and Computation: Practice & Experience
Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

With existing virtualized systems, hypervisor failures lead to overall system failure and the loss of all the work in progress of virtual machines (VMs) running on the system. We introduce ReHype, a mechanism for recovery from hypervisor failures by booting a new instance of the hypervisor while preserving the state of running VMs. VMs are stalled during the hypervisor reboot and resume normal execution once the new hypervisor instance is running. Hypervisor failures can lead to arbitrary state corruption and inconsistencies throughout the system. ReHype deals with the challenge of protecting the recovered hypervisor instance from such corrupted state and resolving inconsistencies between different parts of hypervisor state as well as between the hypervisor and VMs and between the hypervisor and the hardware. We have implemented ReHype for the Xen hypervisor. The implementation was done incrementally, using results from fault injection experiments to identify the sources of dangerous state corruption and inconsistencies. The implementation of ReHype involved only 880 LOC added or modified in Xen. The memory space overhead of ReHype is only 2.1MB for a pristine copy of the hypervisor code and static data plus a small reserved memory area. The fault injection campaigns used to evaluate the effectiveness of ReHype involved a system with multiple VMs running I/O and hypercall-intensive benchmarks. Our experimental results show that the ReHype prototype can successfully recover from over 90% of detected hypervisor failures.