Post-failure recovery of MPI communication capability: Design and rationale

Authors:
Wesley Bland;Aurelien Bouteiller;Thomas Herault;George Bosilca;Jack Dongarra
Affiliations:
Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2013

Citing 15
Cited 0

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Performance Evaluation of the Quadrics Interconnection Network

Cluster Computing
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Redesigning the message logging model for high performance

Concurrency and Computation: Practice & Experience - International Supercomputing Conference
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.