TCP/IP illustrated (vol. 2): the implementation
TCP/IP illustrated (vol. 2): the implementation
Scalable networked information processing environment (SNIPE)
Future Generation Computer Systems - Special issue on metacomputing
When the CRC and TCP checksum disagree
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
End-to-end arguments in system design
ACM Transactions on Computer Systems (TOCS)
BProc: the Beowulf distributed process space
ICS '02 Proceedings of the 16th international conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Supermon: high-performance monitoring for Linux clusters
ALS '01 Proceedings of the 5th annual Linux Showcase & Conference - Volume 5
Design and Implementation of Open MPI over Quadrics/Elan4
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
Virtual machine aware communication libraries for high performance computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proceedings of the 22nd annual international conference on Supercomputing
MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
International Journal of Computational Science and Engineering
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
Infiniband scalability in open MPI
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
SpotMPI: a framework for auction-based HPC computing using amazon spot instances
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
A hybrid fault tolerance scheme for EasyGrid MPI applications
Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
Open MPI: a flexible high performance MPI
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Analysis of the component architecture overhead in open MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Network fault tolerance in open MPI
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High performance checksum computation for fault-tolerant MPI over infiniband
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LAMPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.