Network fault tolerance in open MPI

Authors:
Galen M. Shipman;Richard L. Graham;George Bosilca
Affiliations:
Advanced Computing Laboratory, Los Alamos National Laboratory;National Center for Computational Sciences, Oak Ridge National Laboratory;University of Tennessee, Dept. of Computer Science
Venue:
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Year:
2007

Citing 3
Cited 1

A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
A software based approach for providing network fault tolerance in clusters with uDAPL interface: MPI level design and performance evaluation

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Infiniband scalability in open MPI

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

High performance checksum computation for fault-tolerant MPI over infiniband

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

High Performance Computing (HPC) systems are rapidly growing in size and complexity. As a result, transient and persistent network failures can occur on the time scale of application run times, reducing the productive utilization of these systems. The ubiquitous network protocol used to deal with such failures is TCP/IP, however, available implementations of this protocol provide unacceptable performance for HPC system users, and do not provide the high bandwidth, low latency communications of modern interconnects. This paper describes methods used to provide protection against several network errors such as dropped packets, corrupt packets, and loss of network interfaces while maintaining high-performance communications. Microbenchmark experiments using vendor supplied TCP/IP and O/S bypass low-level communications stacks over InfiniBand and Myrinet are used to demonstrate the high-performance characteristics of our protocol. The NAS Parallel Benchmarks are used to demonstrate the scalability and the minimal performance impact of this protocol. Communication level micro-benchmarks show that providing higher data reliability decreases bandwidth by up to 30% relative to unprotected communications, but provides performance improvements of a factor of four over TCP/IP running over InfiniBand DDR. In addition, application level benchmarks (communication/computation) show virtually no impact of the data reliability protocol on overall run-time.