The Systematic Improvement of Fault Tolerance in the Rio File Cache
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
Diagnosing performance overheads in the xen virtual machine environment
Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Fast Rejuvenation Technique for Server Consolidation with Virtual Machines
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Maintaining Network QoS Across NIC Device Driver Failures Using Virtualization
NCA '09 Proceedings of the 2009 Eighth IEEE International Symposium on Network Computing and Applications
Otherworld: giving applications a chance to survive OS kernel crashes
Proceedings of the 5th European conference on Computer systems
Transparent Fault Tolerance of Device Drivers for Virtual Machines
IEEE Transactions on Computers
Fast Software Rejuvenation of Virtual Machine Monitors
IEEE Transactions on Dependable and Secure Computing
Breaking up is hard to do: security and functionality in a commodity hypervisor
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Trends and challenges in operating systems---from parallel computing to cloud computing
Concurrency and Computation: Practice & Experience
Hi-index | 0.00 |
With existing virtualized systems, hypervisor failures lead to overall system failure and the loss of all the work in progress of virtual machines (VMs) running on the system. We introduce ReHype, a mechanism for recovery from hypervisor failures by booting a new instance of the hypervisor while preserving the state of running VMs. VMs are stalled during the hypervisor reboot and resume normal execution once the new hypervisor instance is running. Hypervisor failures can lead to arbitrary state corruption and inconsistencies throughout the system. ReHype deals with the challenge of protecting the recovered hypervisor instance from such corrupted state and resolving inconsistencies between different parts of hypervisor state as well as between the hypervisor and VMs and between the hypervisor and the hardware. We have implemented ReHype for the Xen hypervisor. The implementation was done incrementally, using results from fault injection experiments to identify the sources of dangerous state corruption and inconsistencies. The implementation of ReHype involved only 880 LOC added or modified in Xen. The memory space overhead of ReHype is only 2.1MB for a pristine copy of the hypervisor code and static data plus a small reserved memory area. The fault injection campaigns used to evaluate the effectiveness of ReHype involved a system with multiple VMs running I/O and hypercall-intensive benchmarks. Our experimental results show that the ReHype prototype can successfully recover from over 90% of detected hypervisor failures.