Lightweight live migration for high availability cluster service

Authors:
Bo Jiang;Binoy Ravindran;Changsoo Kim
Affiliations:
ECE Dept., Virginia Tech;ECE Dept., Virginia Tech;ETRI, Daejeon, South Korea
Venue:
SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
Year:
2010

Citing 16
Cited 2

TCP/IP illustrated (vol. 1): the protocols

TCP/IP illustrated (vol. 1): the protocols
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The dangers of replication and a solution

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Process migration

ACM Computing Surveys (CSUR)
Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism

Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism
High Availability: Design, Techniques and Processes

High Availability: Design, Techniques and Processes
Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

ACM SIGACT News
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The Linux Enterprise Cluster

The Linux Enterprise Cluster
Virtualization for high-performance computing

ACM SIGOPS Operating Systems Review
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Live wide-area migration of virtual machines including local persistent state

Proceedings of the 3rd international conference on Virtual execution environments
Olive: distributed point-in-time branching storage for real systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation

Enhancing the performance of high availability lightweight live migration

OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
A survey of migration mechanisms of virtual machines

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

High availability is a critical feature for service clusters and cloud computing, and is often considered more valuable than performance. One commonly used technique to enhance the availability is live migration, which replicates services based on virtualization technology. However, continuous live migration with checkpointing will introduce significant overhead. In this paper, we present a lightweight live migration (LLM) mechanism to integrate wholesystem migration and input replay efforts, which aims at reducing the overhead while providing comparable availability. LLM migrates service requests from network clients at high frequency during the interval of checkpointing system updates. Once a failure happens to the primary machine, the backup machine will continue the service based on the virtual machine image and network inputs at their respective last migration rounds. We implemented LLM based on Xen and compared it with Remus--a state-of-the-art effort that enhances the availability by checkpointing system status updates. Our experimental evaluations show that LLM clearly outperforms Remus in terms of network delay and overhead. For certain types of applications, LLM may also be a better alternative in terms of downtime than Remus. In addition, LLM achieves transaction level consistency like Remus.