System Management in the BlueGene/L Supercomputer

  • Authors:
  • G. Almasi;L. Bachega;R. Bellofatto;J. Brunheroto;C. Cascaval;J. Castaños;P. Crumley;C. Erway;J. Gagliano;D. Lieber;P. Mindlin;J. E. Moreira;R. K. Sahoo;A. Sanomiya;E. Schenfeld;R. Swetz;M. Bae;G. Laib;K. Ranganathan;Y. Aridor;T. Domany;Y. Gal;O. Goldshmidt;E. Shmueli

  • Affiliations:
  • -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-

  • Venue:
  • IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

With 65,536 compute nodes, the BlueGene/L supercomputer represents a new level of scalability for parallel systems. In this paper, we discuss system management and control for BlueGene/L, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65,536 compute nodes are organized in 1,024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65 th node, called an I/O node. The 1,024 processing sets can then be managed to a great extent as a regular Linux cluster. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network.