Live Migration of Parallel Applications with OpenVZ

  • Authors:
  • Fabian Romero;Thomas J. Hacker

  • Affiliations:
  • -;-

  • Venue:
  • WAINA '11 Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A parallel application can terminate or produce incorrect results when a computational node fails. As the number of components in large scale supercomputing systems increase and applications scale to use these resources, the mean time to failure decreases, and application failure becomes more likely. Traditional fault tolerance approaches to address this problem, such as check pointing, are failing to scale as system sizes increase. An alternative approach we explore in this paper is the use of VM based live migration to move a process from a failing node to a healthy one to reduce the fault rate experienced by an application. We investigate the use of an operating system-level virtualization environment based on OpenVZ to perform live migrations of virtual machines on which multi-processor parallel applications are running. We explore the correctness, performance, and reliability implications of this approach, and explore the additional overhead of using OS-level virtualized systems for fault recovery. Our results confirm that it is possible to efficiently live migrate virtual containers running a parallel application without affecting the correctness or completion of parallel applications running in a OS-level virtualized environment.