HARNESS fault tolerant MPI design, usage and performance issues

Authors:
Graham E. Fagg;Jack J. Dongarra
Affiliations:
High Performance Computing Center Stuttgart (HLRS), Parallel and Distributed Systems, Allmandring 30, D-70550 Stuttgart, Germany;Department of Computer Science, Suite 413, 1122 Volunteer Blvd., University of Tennessee, Knoxville, TN
Venue:
Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
Year:
2002

Citing 17
Cited 5

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
The Globus toolkit

The grid
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
Scalable networked information processing environment (SNIPE)

Future Generation Computer Systems - Special issue on metacomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A grid-enabled MPI: message passing in heterogeneous distributed computing systems

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Using MPI-2: Advanced Features of the Message Passing Interface

Using MPI-2: Advanced Features of the Message Passing Interface
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Distributed Computing in a Heterogeneous Computing Environment

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPI_Connect Managing Heterogeneous MPI Applications Ineroperation and Process Control

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A Proposal for a Set of Parallel Basic Linear Algebra Subprograms

A Proposal for a Set of Parallel Basic Linear Algebra Subprograms
The GrADS Project: Software Support for High-Level Grid Application Development

International Journal of High Performance Computing Applications
Numerical Libraries and the Grid

International Journal of High Performance Computing Applications

Creating a transparent, distributed, and resilient computing environment: the OpenRTE project

The Journal of Supercomputing
The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing

Future Generation Computer Systems
Migol: A fault-tolerant service framework for MPI applications in the grid

Future Generation Computer Systems
Self-healing network for scalable fault-tolerant runtime environments

Future Generation Computer Systems
The open run-time environment (OpenRTE): a transparent multi-cluster environment for high-performance computing

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model suitable for use on clusters or distributed systems would have reduced their performance. As current HPC collaborative applications increase in size and distribution the potential levels of node and network failures increase. This is especially true when MPI implementations are used as the communication media for GRID applications where the GRID architectures themselves are inherently unreliable thus requiring new fault tolerant MPI systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications and some performance issues such as efficient group communications and complex data handling. Also briefly described is the HARNESS g_hcore system that handles low-level system operations on behalf of the MPI implementation. This includes details of plug-in services developed and their interaction with the FT-MPI runtime library.