Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Authors:
Oreste Villa;Sriram Krishnamoorthy;Jarek Nieplocha;David M. Brown, Jr.
Affiliations:
PNNL, Richland, USA;PNNL, Richland, USA;PNNL, Richland, USA;PNNL, Richland, USA
Venue:
Proceedings of the 6th ACM conference on Computing frontiers
Year:
2009

Citing 18
Cited 3

Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Load balancing of molecular dynamics simulation with NWChem

IBM Systems Journal - Deep computing for the life sciences
A Multi-Platform Co-Array Fortran Compiler

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Diagnosing performance overheads in the xen virtual machine environment

Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
High performance RDMA-based MPI implementation over infiniBand

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
High Performance Remote Memory Access Communication: The Armci Approach

International Journal of High Performance Computing Applications
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
VMware Server and VMware Player. The way forward for Virtualization

VMware Server and VMware Player. The way forward for Virtualization
Measuring CPU overhead for I/O processing in the Xen virtual machine monitor

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Nomad: migrating OS-bypass networks in virtual machines

Proceedings of the 3rd international conference on Virtual execution environments
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
High performance virtual machine migration with RDMA over modern interconnects

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing

BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A medical image file accessing system with virtualization fault tolerance on cloud

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to address fault-tolerance for applications based on global address space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating the scalability of our approach in medium/large-scale systems.