Proactive fault tolerance in MPI applications via task migration

Authors:
Sayantan Chakravorty;Celso L. Mendes;Laxmikant V. Kalé
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign;Department of Computer Science, University of Illinois at Urbana-Champaign;Department of Computer Science, University of Illinois at Urbana-Champaign
Venue:
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Year:
2006

Citing 15
Cited 13

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

Cluster Computing
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proactive Recovery in Distributed CORBA Applications

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
Building and Using a Fault-Tolerant MPI Implementation

International Journal of High Performance Computing Applications
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study

ICCS'03 Proceedings of the 2003 international conference on Computational science
Scalable cosmological simulations on parallel machines

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science

Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
Managing performance of aging applications via synchronized replica rejuvenation

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Performance evaluation of bag of gangs scheduling in a heterogeneous distributed system

Journal of Systems and Software
Optimized pre-copy live migration for memory intensive applications

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.