Analyzing fault aware collective performance in a process fault tolerant MPI

Authors:
Joshua Hursey;Richard L. Graham
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
Venue:
Parallel Computing
Year:
2012

Citing 22
Cited 0

The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
System structure for software fault tolerance

Proceedings of the international conference on Reliable software
MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

Cluster Computing
On optimizing collective communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Building Single Fault Survivable Parallel Algorithms for Matrix Operations Using Redundant Parallel Computation

CIT '07 Proceedings of the 7th IEEE International Conference on Computer and Information Technology
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing
Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Preserving Collective Performance across Process Failure for a Fault Tolerant MPI

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Scalable fault tolerant MPI: extending the recovery algorithm

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Super-Scalable algorithms for computing on 100,000 processors

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what traditional techniques alone can provide. Applications will depend on libraries to sustain failure-free performance across process failure to continue to use High Performance Computing (HPC) systems efficiently even in the presence of process failure. Optimized Message Passing Interface (MPI) collective operations are a critical component of many scalable HPC applications. However, most of the collective algorithms are not able to handle process failure. Next generation MPI implementations must provide fault aware versions of such algorithms that can sustain performance across process failure. This paper discusses the design and implementation of fault aware collective algorithms for tree structured communication patterns. The three design approaches of rerouting, lookup avoiding and rebalancing are described, and analyzed for their performance impact relative to similar fault unaware barrier and broadcast collective algorithms. The analysis shows that the rerouting approach causes a significant performance degradation while the rebalancing approach can bring the performance within 1% of the fault unaware performance. This paper also presents the impact of the run-through stabilization prototype on point-to-point communication, and analyzes the time to rebalance the tree while accounting for process failures.