Proactive fault tolerance for HPC with Xen virtualization

Authors:
Arun Babu Nagarajan;Frank Mueller;Christian Engelmann;Stephen L. Scott
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
Proceedings of the 21st annual international conference on Supercomputing
Year:
2007

Citing 27
Cited 47

Attacking the process migration bottleneck

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Fine-grained mobility in the Emerald system

ACM Transactions on Computer Systems (TOCS)
Transparent process migration: design alternatives and the sprite implementation

Software—Practice & Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The performance of μ-kernel-based systems

Proceedings of the sixteenth ACM symposium on Operating systems principles
Preemptable remote execution facilities for the V-system

Proceedings of the tenth ACM symposium on Operating systems principles
Architectural requirements and scalability of the NAS parallel benchmarks

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Process migration in DEMOS/MP

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Internet Suspend/Resume

WMCSA '02 Proceedings of the Fourth IEEE Workshop on Mobile Computing Systems and Applications
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The design and implementation of Zap: a system for migrating computing environments

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Optimizing the migration of virtual computers

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Availability Modeling and Analysis on High Performance Cluster Computing Systems

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
Self-migration of operating systems

Proceedings of the 11th workshop on ACM SIGOPS European workshop
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
A case for high performance computing with virtual machines

Proceedings of the 20th annual international conference on Supercomputing
Constructing services with interposable virtual hardware

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Optimizing network virtualization in Xen

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Virtual machine aware communication libraries for high performance computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Power-aware dynamic placement of HPC applications

Proceedings of the 22nd annual international conference on Supercomputing
Autonomous learning for efficient resource utilization of dynamic VM migration

Proceedings of the 22nd annual international conference on Supercomputing
The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines

Proceedings of the 2nd workshop on System-level virtualization for high performance computing
A tunable holistic resiliency approach for high-performance computing systems

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An evaluation of multiple communication interfaces for virtualized SMP clusters

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Performance comparison of two virtual machine scenarios using an HPC application: a case study using molecular dynamics simulations

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
Live migration of virtual machine based on full system trace and replay

Proceedings of the 18th ACM international symposium on High performance distributed computing
Self-Tuning Virtual Machines for Predictable eScience

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Availability analysis of application servers using software rejuvenation and virtualization

Journal of Computer Science and Technology
Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software

Cluster Computing
HPVZ: A High Performance Virtual Computing Environment for Super Computers

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
Cost of Virtual Machine Live Migration in Clouds: A Performance Evaluation

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Evaluating MapReduce on Virtual Machines: The Hadoop Case

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Managing performance of aging applications via synchronized replica rejuvenation

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Rethink the virtual machine template

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Performance and energy modeling for live migration of virtual machines

Proceedings of the 20th international symposium on High performance distributed computing
Implementation of a green power management algorithm for virtual machines on cloud computing

UIC'11 Proceedings of the 8th international conference on Ubiquitous intelligence and computing
Optimized pre-copy live migration for memory intensive applications

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance evaluation of MapReduce using full virtualisation on a departmental cloud

International Journal of Applied Mathematics and Computer Science - SPECIAL SECTION: Efficient Resource Management for Grid-Enabled Applications
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
CompSC: live migration with pass-through devices

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Performance evaluation of HPC benchmarks on VMware's ESXi server

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems

Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
A hybrid local storage transfer scheme for live migration of I/O intensive workloads

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Speculative Memory State Transfer for Active-Active Fault Tolerance

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A medical image file accessing system with virtualization fault tolerance on cloud

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Enhancing data center sustainability through energy-adaptive computing

ACM Journal on Emerging Technologies in Computing Systems (JETC)
On construction of cloud iaas for VM live migration using KVM and opennebula

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
X10-FT: transparent fault tolerance for APGAS language and runtime

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Parallelizing live migration of virtual machines

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Configurable performance analysis and evaluation framework for cloud systems

International Journal of Information and Communication Technology
Autonomous massively multiplayer online game operation on unreliable resources

Proceedings of the International C* Conference on Computer Science and Software Engineering
Performance and energy modeling for live migration of virtual machines

Cluster Computing
On the use of a proportional-share market for application SLO support in clouds

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Scheduling highly available applications on cloud environments

Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.