Proactive process-level live migration in HPC environments

Authors:
Chao Wang;Frank Mueller;Christian Engelmann;Stephen L. Scott
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 32
Cited 17

Fine-grained mobility in the Emerald system

ACM Transactions on Computer Systems (TOCS)
Transparent process migration: design alternatives and the sprite implementation

Software—Practice & Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Preemptable remote execution facilities for the V-system

Proceedings of the tenth ACM symposium on Operating systems principles
Architectural requirements and scalability of the NAS parallel benchmarks

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Process migration

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Process migration in DEMOS/MP

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Terrestrial-Based Radiation Upsets: A Cautionary Tale

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Process Migration for MPI Applications based on Coordinated Checkpoint

ICPADS '05 Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An Agent Oriented Proactive Fault-Tolerant Framework for Grid Computing

E-SCIENCE '05 Proceedings of the First International Conference on e-Science and Grid Computing
Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters

ISPDC '05 Proceedings of the The 4th International Symposium on Parallel and Distributed Computing
Availability Modeling and Analysis on High Performance Cluster Computing Systems

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
MPI-Mitten: Enabling Migration Technology in MPI

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Dynamic Scheduling with Process Migration

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Optimizing network virtualization in Xen

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Meta-Learning Failure Predictor for Blue Gene/L Systems

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Framework for Proactive Fault Tolerance

ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Toward Predictive Failure Management for Distributed Stream Processing Systems

ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Evaluation of fault-tolerant policies using simulation

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

A tunable holistic resiliency approach for high-performance computing systems

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Compiler-support for robust multi-core computing

ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
Rethink the virtual machine template

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
A resiliency model for high performance infrastructure based on logical encapsulation

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Transparent Accelerator Migration in a Virtualized GPU Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Optimizing datacenter power with memory system levers for guaranteed quality-of-service

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic resource-centric process migration for MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications
A novel service-oriented intelligent seamless migration algorithm and application for pervasive computing environments

Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.