Fine-grained mobility in the Emerald system
ACM Transactions on Computer Systems (TOCS)
Transparent process migration: design alternatives and the sprite implementation
Software—Practice & Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Preemptable remote execution facilities for the V-system
Proceedings of the tenth ACM symposium on Operating systems principles
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A first order approximation to the optimum checkpoint interval
Communications of the ACM
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Monitoring hard disks with smart
Linux Journal
Lightweight monitoring of MPI programs in real time: Research Articles
Concurrency and Computation: Practice & Experience
Terrestrial-Based Radiation Upsets: A Cautionary Tale
FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Process Migration for MPI Applications based on Coordinated Checkpoint
ICPADS '05 Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01
A Power-Aware Run-Time System for High-Performance Computing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An Agent Oriented Proactive Fault-Tolerant Framework for Grid Computing
E-SCIENCE '05 Proceedings of the First International Conference on e-Science and Grid Computing
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Performance evaluation of adaptive MPI
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
ISPDC '05 Proceedings of the The 4th International Symposium on Parallel and Distributed Computing
Availability Modeling and Analysis on High Performance Cluster Computing Systems
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
MPI-Mitten: Enabling Migration Technology in MPI
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable, fault tolerant membership for MPI tasks on HPC systems
Proceedings of the 20th annual international conference on Supercomputing
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Dynamic Scheduling with Process Migration
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Optimizing network virtualization in Xen
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance VMM-bypass I/O in virtual machines
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Meta-Learning Failure Predictor for Blue Gene/L Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Framework for Proactive Fault Tolerance
ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Toward Predictive Failure Management for Distributed Stream Processing Systems
ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Adaptive Fault Management of Parallel Applications for High-Performance Computing
IEEE Transactions on Computers
Fault-Aware Runtime Strategies for High-Performance Computing
IEEE Transactions on Parallel and Distributed Systems
Evaluation of fault-tolerant policies using simulation
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DMTCP: Transparent checkpointing for cluster computations and the desktop
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
RDMA-Based Job Migration Framework for MPI over InfiniBand
CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Loop profiling tool for HPC code inspection as an efficient method of FPGA based acceleration
International Journal of Applied Mathematics and Computer Science
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Hi-index | 0.00 |
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.