A network-failure-tolerant message-passing system for terascale clusters

Authors:
Richard L. Graham;Sung-Eun Choi;David J. Daniel;Nehal N. Desai;Ronald G. Minnich;Craig E. Rasmussen;L. Dean Risinger;Mitchel W. Sukalski
Affiliations:
Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico
Venue:
International Journal of Parallel Programming
Year:
2003

Citing 9
Cited 17

TCP/IP illustrated (vol. 2): the implementation

TCP/IP illustrated (vol. 2): the implementation
Scalable networked information processing environment (SNIPE)

Future Generation Computer Systems - Special issue on metacomputing
When the CRC and TCP checksum disagree

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
End-to-end arguments in system design

ACM Transactions on Computer Systems (TOCS)
BProc: the Beowulf distributed process space

ICS '02 Proceedings of the 16th international conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Supermon: high-performance monitoring for Linux clusters

ALS '01 Proceedings of the 5th annual Linux Showcase & Conference - Volume 5

Design and Implementation of Open MPI over Quadrics/Elan4

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
A software based approach for providing network fault tolerance in clusters with uDAPL interface: MPI level design and performance evaluation

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Proceedings of the 21st annual international conference on Supercomputing
Virtual machine aware communication libraries for high performance computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband

Proceedings of the 22nd annual international conference on Supercomputing
MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimisation of the execution time inspired in Cross Layer design using effective load balancing in a LAN-WLAN environment

International Journal of Computational Science and Engineering
A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
Infiniband scalability in open MPI

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
SpotMPI: a framework for auction-based HPC computing using amazon spot instances

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
A hybrid fault tolerance scheme for EasyGrid MPI applications

Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
Open MPI: a flexible high performance MPI

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Analysis of the component architecture overhead in open MPI

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Network fault tolerance in open MPI

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High performance checksum computation for fault-tolerant MPI over infiniband

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LAMPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.