System Management in the BlueGene/L Supercomputer

Authors:
G. Almasi;L. Bachega;R. Bellofatto;J. Brunheroto;C. Cascaval;J. Castaños;P. Crumley;C. Erway;J. Gagliano;D. Lieber;P. Mindlin;J. E. Moreira;R. K. Sahoo;A. Sanomiya;E. Schenfeld;R. Swetz;M. Bae;G. Laib;K. Ranganathan;Y. Aridor;T. Domany;Y. Gal;O. Goldshmidt;E. Shmueli
Affiliations:
-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-
Venue:
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Year:
2003

Citing 0
Cited 6

The Cluster Monitoring & Controlling Method with Scalable Communication Framework

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Multitoroidal Interconnects For Tightly Coupled Supercomputers

IEEE Transactions on Parallel and Distributed Systems
Lossless compression for large scale cluster logs

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A proactive fault-detection mechanism in large-scale cluster systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Event-driven configuration of a neural network CMP system over an homogeneous interconnect fabric

Parallel Computing
Open job management architecture for the blue gene/l supercomputer

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With 65,536 compute nodes, the BlueGene/L supercomputer represents a new level of scalability for parallel systems. In this paper, we discuss system management and control for BlueGene/L, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65,536 compute nodes are organized in 1,024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65 th node, called an I/O node. The 1,024 processing sets can then be managed to a great extent as a regular Linux cluster. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network.