Scalable fault tolerant MPI: extending the recovery algorithm

Authors:
Graham E. Fagg;Thara Angskun;George Bosilca;Jelena Pjesivac-Grbovic;Jack J. Dongarra
Affiliations:
Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN
Venue:
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2005

Citing 10
Cited 6

MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
Scalable networked information processing environment (SNIPE)

Future Generation Computer Systems - Special issue on metacomputing
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
HARNESS and fault tolerant MPI

Parallel Computing - Clusters and computational grids for scientific computing
Distributed Systems: Principles and Paradigms

Distributed Systems: Principles and Paradigms
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing

Optimisation of the execution time inspired in Cross Layer design using effective load balancing in a LAN-WLAN environment

International Journal of Computational Science and Engineering
Towards Efficient MapReduce Using MPI

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
FT-MPI, fault-tolerant metacomputing and generic name services: a case study

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A bluetooth MPI framework for collaborative computer graphics

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault Tolerant MPI (FT-MPI) [6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.