Suspending, migrating and resuming HPC virtual clusters

  • Authors:
  • Paolo Anedda;Simone Leo;Simone Manca;Massimo Gaggero;Gianluigi Zanetti

  • Affiliations:
  • Distributed Computing Group, CRS4, Edificio 1, Polaris, Pula, Italy;Distributed Computing Group, CRS4, Edificio 1, Polaris, Pula, Italy;Distributed Computing Group, CRS4, Edificio 1, Polaris, Pula, Italy;Distributed Computing Group, CRS4, Edificio 1, Polaris, Pula, Italy;Distributed Computing Group, CRS4, Edificio 1, Polaris, Pula, Italy

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

A systematic study of issues related to suspending, migrating and resuming virtual clusters for data-driven HPC applications is presented. The interest is focused on nontrivial virtual clusters, that is where the running computation is expected to be coordinated and strongly coupled. It is shown that this requires that all cluster level operations, such as start and save, should be performed as synchronously as possible on all nodes, introducing the need of barriers at the virtual cluster computing meta-level. Once a synchronization mechanism is provided, and appropriate transport strategies have been setup, it is possible to suspend, migrate and resume whole virtual clusters composed of ''heavy'' (4 GB RAM, 6 GB disk images) virtual machines in times of the order of few minutes without disrupting parallel computation-albeit of the MapReduce type-running inside them. The approach is intrinsically parallel, and should scale without problems to larger size virtual clusters.