A fault-tolerant strategy for virtualized HPC clusters

Authors:
John Paul Walters;Vipin Chaudhary
Affiliations:
Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, USA;Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, USA
Venue:
The Journal of Supercomputing
Year:
2009

Citing 28
Cited 3

MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
Architectural requirements and scalability of the NAS parallel benchmarks

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Guest Editors' Introduction: Welcome to the Opportunities of Binary Translation

Computer
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory resource management in VMware ESX server

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Libckpt: Transparent Checkpointing under Unix

Libckpt: Transparent Checkpointing under Unix
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
Diagnosing performance overheads in the xen virtual machine environment

Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
The Architecture of Virtual Machines

Computer
User-level checkpoint and recovery for LAM/MPI

ACM SIGOPS Operating Systems Review
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
A comparison of software and hardware techniques for x86 virtualization

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
HDTrans: an open source, low-level dynamic instrumentation system

Proceedings of the 2nd international conference on Virtual execution environments
Measuring CPU overhead for I/O processing in the Xen virtual machine monitor

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Xen and the art of repeated research

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance and scalable I/O virtualization via self-virtualized devices

Proceedings of the 16th international symposium on High performance distributed computing
Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Computing in the clouds

netWorker - Cloud computing: PC functions move onto the web
NFS-CD: Write-Enabled Cooperative Caching in NFS

IEEE Transactions on Parallel and Distributed Systems
Replication-Based Fault Tolerance for MPI Applications

IEEE Transactions on Parallel and Distributed Systems
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing

A resiliency model for high performance infrastructure based on logical encapsulation

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A medical image file accessing system with virtualization fault tolerance on cloud

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Comparison of VM deployment methods for HPC education

Proceedings of the 1st Annual conference on Research in information technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Virtualization is a common strategy for improving the utilization of existing computing resources, particularly within data centers. However, its use for high performance computing (HPC) applications is currently limited despite its potential for both improving resource utilization as well as providing resource guarantees to its users. In this article, we systematically evaluate three major virtual machine implementations for computationally intensive HPC applications using various standard benchmarks. Using VMWare Server, Xen, and OpenVZ, we examine the suitability of full virtualization (VMWare), paravirtualization (Xen), and operating system-level virtualization (OpenVZ) in terms of network utilization, SMP performance, file system performance, and MPI scalability. We show that the operating system-level virtualization provided by OpenVZ provides the best overall performance, particularly for MPI scalability. With the knowledge gained by our VM evaluation, we extend OpenVZ to include support for checkpointing and fault-tolerance for MPI-based virtual server distributed computing.