SHIELD: a fault-tolerant MPI for an infiniband cluster

  • Authors:
  • Hyuck Han;Hyungsoo Jung;Jai Wug Kim;Jongpil Lee;Youngjin Yu;Shin Gyu Kim;Heon Y. Yeom

  • Affiliations:
  • School of Computer Science and Engineering, Seoul National University, Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, Seoul, South Korea

  • Venue:
  • HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

Today's high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper presents SHIELD, a practical and easily-deployable fault-tolerant MPI and management system of MPI for an Infiniband cluster. SHIELD provides a novel framework that can be easily used in real cluster systems, and it has different design perspectives than those proposed by other fault-tolerant MPI. We show that SHIELD provides robust fault-resilience to fault-vulnerable cluster systems and that the design features of SHIELD are useful wherever fault-resilience is regarded as the matter of utmost importance.