Fine-grained mobility in the Emerald system
ACM Transactions on Computer Systems (TOCS)
Transparent process migration: design alternatives and the sprite implementation
Software—Practice & Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Preemptable remote execution facilities for the V-system
Proceedings of the tenth ACM symposium on Operating systems principles
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A first order approximation to the optimum checkpoint interval
Communications of the ACM
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Terrestrial-Based Radiation Upsets: A Cautionary Tale
FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Process Migration for MPI Applications based on Coordinated Checkpoint
ICPADS '05 Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01
A Power-Aware Run-Time System for High-Performance Computing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An Agent Oriented Proactive Fault-Tolerant Framework for Grid Computing
E-SCIENCE '05 Proceedings of the First International Conference on e-Science and Grid Computing
ISPDC '05 Proceedings of the The 4th International Symposium on Parallel and Distributed Computing
Availability Modeling and Analysis on High Performance Cluster Computing Systems
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
MPI-Mitten: Enabling Migration Technology in MPI
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Dynamic Scheduling with Process Migration
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Optimizing network virtualization in Xen
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance VMM-bypass I/O in virtual machines
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Meta-Learning Failure Predictor for Blue Gene/L Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Framework for Proactive Fault Tolerance
ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Toward Predictive Failure Management for Distributed Stream Processing Systems
ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Evaluation of fault-tolerant policies using simulation
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
A tunable holistic resiliency approach for high-performance computing systems
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
Journal of Parallel and Distributed Computing
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Compiler-support for robust multi-core computing
ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
Rethink the virtual machine template
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
A resiliency model for high performance infrastructure based on logical encapsulation
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Transparent Accelerator Migration in a Virtualized GPU Environment
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Optimizing datacenter power with memory system levers for guaranteed quality-of-service
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic resource-centric process migration for MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 0.00 |
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.