A network-failure-tolerant message-passing system for terascale clusters

  • Authors:
  • Richard L. Graham;Sung-Eun Choi;David J. Daniel;Nehal N. Desai;Ronald G. Minnich;Craig E. Rasmussen;L. Dean Risinger;Mitchel W. Sukalski

  • Affiliations:
  • Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico;Los Alamos National Laboratory, Advanced Computing Laboratory, MS-B287 Los Alamos, New Mexico

  • Venue:
  • International Journal of Parallel Programming
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LAMPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.