MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

Authors:
Rajanikanth Batchu;Anthony Skjellum;Zhenqian Cui;Murali Beddhu;Jothi P. Neelamegam;Yoginder Dandass;Manoj Apte
Affiliations:
-;-;-;-;-;-;-
Venue:
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Year:
2001

Citing 0
Cited 22

MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications

IEEE Transactions on Software Engineering
MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

Cluster Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
A Simple MPI Process Swapping Architecture for Iterative Applications

International Journal of High Performance Computing Applications
A channel memory based fault tolerance for MPI applications

Future Generation Computer Systems - Special issue: Parallel computing technologies
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
A robust framework for real-time distributed processing of satellite data

Journal of Parallel and Distributed Computing
HeteroMPI: Towards a message-passing library for heterogeneous networks of computers

Journal of Parallel and Distributed Computing
Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A High-Level Interpreted MPI Library for Parallel Computing in Volunteer Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
An intelligent management of fault tolerance in cluster using RADICMPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Scalable fault tolerant MPI: extending the recovery algorithm

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A communication framework for fault-tolerant parallel execution

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dependable Systems
Estimation of MPI application performance on volunteer environments

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated.User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self-checking threads address four levels of MPI implementation robustness, three of which are portable to any multi-threaded MPI. A taxonomy of application types provides six initial fault-relevant models; user-transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT - real-time MPI - are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future.