Proactive process-level live migration and back migration in HPC environments

Authors:
Chao Wang;Frank Mueller;Christian Engelmann;Stephen L. Scott
Affiliations:
Department of Computer Science, North Carolina State University, Raleigh, NC 27695-7534, United States;Department of Computer Science, North Carolina State University, Raleigh, NC 27695-7534, United States;Oak Ridge National Laboratory, Computational Sciences and Mathematics Division, Oak Ridge, TN 37831, United States;Oak Ridge National Laboratory, Computational Sciences and Mathematics Division, Oak Ridge, TN 37831, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2012

Citing 47
Cited 0

Fine-grained mobility in the Emerald system

ACM Transactions on Computer Systems (TOCS)
Transparent process migration: design alternatives and the sprite implementation

Software—Practice & Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Preemptable remote execution facilities for the V-system

Proceedings of the tenth ACM symposium on Operating systems principles
Architectural requirements and scalability of the NAS parallel benchmarks

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Process migration

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Process migration in DEMOS/MP

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Monitoring hard disks with smart

Linux Journal
Lightweight monitoring of MPI programs in real time: Research Articles

Concurrency and Computation: Practice & Experience
Terrestrial-Based Radiation Upsets: A Cautionary Tale

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Process Migration for MPI Applications based on Coordinated Checkpoint

ICPADS '05 Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01
A Power-Aware Run-Time System for High-Performance Computing

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An Agent Oriented Proactive Fault-Tolerant Framework for Grid Computing

E-SCIENCE '05 Proceedings of the First International Conference on e-Science and Grid Computing
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters

ISPDC '05 Proceedings of the The 4th International Symposium on Parallel and Distributed Computing
Availability Modeling and Analysis on High Performance Cluster Computing Systems

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
MPI-Mitten: Enabling Migration Technology in MPI

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable, fault tolerant membership for MPI tasks on HPC systems

Proceedings of the 20th annual international conference on Supercomputing
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Dynamic Scheduling with Process Migration

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Optimizing network virtualization in Xen

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance VMM-bypass I/O in virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Meta-Learning Failure Predictor for Blue Gene/L Systems

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Framework for Proactive Fault Tolerance

ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Toward Predictive Failure Management for Distributed Stream Processing Systems

ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Adaptive Fault Management of Parallel Applications for High-Performance Computing

IEEE Transactions on Computers
Fault-Aware Runtime Strategies for High-Performance Computing

IEEE Transactions on Parallel and Distributed Systems
Evaluation of fault-tolerant policies using simulation

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DMTCP: Transparent checkpointing for cluster computations and the desktop

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
RDMA-Based Job Migration Framework for MPI over InfiniBand

CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Loop profiling tool for HPC code inspection as an efficient method of FPGA based acceleration

International Journal of Applied Mathematics and Computer Science
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.