Poster: scalable infrastructure to support supercomputer resiliency-aware applications and load balancing

  • Authors:
  • Yoav Tock;Benjamin Mandler;José Moreira;Terry Jones

  • Affiliations:
  • IBM Haifa Research Lab, Haifa, Israel;IBM Haifa Research Lab, Haifa, Israel;IBM T. J. Watson Research Center, Yorktown Heights, NY, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA

  • Venue:
  • Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

High performance computing systems display increasing complexity and component counts. This trend exposes weaknesses in the underlying clustering infrastructure needed for continuous availability, maximizing utilization, and efficient administration of such systems. To mitigate the problem, we present a highly scalable clustering infrastructure, based on peer-to-peer technologies, for supporting resiliency-aware applications as well as efficient monitoring and load balancing. Supported services include Membership, Publish-subscribe messaging, Convergecast, Attribute replication and a DHT. We present a preliminary evaluation taken from an IBM BlueGene/P, demonstrating scalability up to ~ 256K nodes.