Self-healing network for scalable fault-tolerant runtime environments

  • Authors:
  • Thara Angskun;Graham Fagg;George Bosilca;Jelena Pješivac-Grbović;Jack Dongarra

  • Affiliations:
  • Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms.