Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proactive Recovery in Distributed CORBA Applications
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study
ICCS'03 Proceedings of the 2003 international conference on Computational science
Scalable cosmological simulations on parallel machines
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
A scalable asynchronous replication-based strategy for fault tolerant MPI applications
HiPC'07 Proceedings of the 14th international conference on High performance computing
Managing performance of aging applications via synchronized replica rejuvenation
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Performance evaluation of bag of gangs scheduling in a heterogeneous distributed system
Journal of Systems and Software
Optimized pre-copy live migration for memory intensive applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.