Fault Tolerant MPI for the HARNESS Meta-computing System

Authors:
Graham E. Fagg;Antonin Bukovsky;Jack Dongarra
Affiliations:
-;-;-
Venue:
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Year:
2001

Citing 12
Cited 3

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
Scalable networked information processing environment (SNIPE)

Future Generation Computer Systems - Special issue on metacomputing
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Flattening on the Fly: Efficient Handling of MPI Derived Datatypes

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
ACCT: Automatic Collective Communications Tuning

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Distributed Computing in a Heterogeneous Computing Environment

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPI_Connect Managing Heterogeneous MPI Applications Ineroperation and Process Control

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
PVM Emulation in the Harness Metacomputing System: A Plug-in Based Approach

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing

A Lightweight Kernel for the Harness Metacomputing Framework

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
A parallel plug-in programming paradigm

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Design and performance analysis of a message scheduling scheme for WLAN-based cluster computing

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Initial versions of MPI were designed to work efficiently on multiprocessors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model suitable for use on clusters or distributed systems would have reduced their performance. As current HPC collaborative applications increase in size and distribution the potential levels of node and network failures increase the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications and some performance issues such as efficient group communications and complex data handling.