The Cluster Monitoring & Controlling Method with Scalable Communication Framework
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Multitoroidal Interconnects For Tightly Coupled Supercomputers
IEEE Transactions on Parallel and Distributed Systems
Lossless compression for large scale cluster logs
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A proactive fault-detection mechanism in large-scale cluster systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Open job management architecture for the blue gene/l supercomputer
JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Hi-index | 0.00 |
With 65,536 compute nodes, the BlueGene/L supercomputer represents a new level of scalability for parallel systems. In this paper, we discuss system management and control for BlueGene/L, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65,536 compute nodes are organized in 1,024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65 th node, called an I/O node. The 1,024 processing sets can then be managed to a great extent as a regular Linux cluster. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network.