Lightweight live migration for high availability cluster service

  • Authors:
  • Bo Jiang;Binoy Ravindran;Changsoo Kim

  • Affiliations:
  • ECE Dept., Virginia Tech;ECE Dept., Virginia Tech;ETRI, Daejeon, South Korea

  • Venue:
  • SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

High availability is a critical feature for service clusters and cloud computing, and is often considered more valuable than performance. One commonly used technique to enhance the availability is live migration, which replicates services based on virtualization technology. However, continuous live migration with checkpointing will introduce significant overhead. In this paper, we present a lightweight live migration (LLM) mechanism to integrate wholesystem migration and input replay efforts, which aims at reducing the overhead while providing comparable availability. LLM migrates service requests from network clients at high frequency during the interval of checkpointing system updates. Once a failure happens to the primary machine, the backup machine will continue the service based on the virtual machine image and network inputs at their respective last migration rounds. We implemented LLM based on Xen and compared it with Remus--a state-of-the-art effort that enhances the availability by checkpointing system status updates. Our experimental evaluations show that LLM clearly outperforms Remus in terms of network delay and overhead. For certain types of applications, LLM may also be a better alternative in terms of downtime than Remus. In addition, LLM achieves transaction level consistency like Remus.