Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Processor allocation using partitioning in mesh connected parallel computers
Journal of Parallel and Distributed Computing
An Efficient Task Allocation Scheme for 2D Mesh Architectures
IEEE Transactions on Parallel and Distributed Systems
Job Scheduling in Mesh Multicomputers
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Processor Scheduling and Allocation for 3D Torus Multicomputer Systems
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Non-contiguous processor allocation algorithms for distributed memory multicomputers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares
IEEE Transactions on Computers
Job Scheduling in a Partitionable Mesh Using a Two-Dimensional Buddy System Partitioning Scheme
IEEE Transactions on Parallel and Distributed Systems
Allocating Precise Submeshes in Mesh Connected Systems
IEEE Transactions on Parallel and Distributed Systems
Job Scheduling is More Important than Processor Allocation for Hypercube Computers
IEEE Transactions on Parallel and Distributed Systems
Efficient processor allocation for 3D tori
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
The ANL/IBM SP Scheduling System
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The EASY - LoadLeveler API Project
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Improved Utilization and Responsiveness with Gang Scheduling
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Job Scheduling for the BlueGene/L System
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
System Management in the BlueGene/L Supercomputer
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient Subtorus Processor Allocation in a Multi-Dimensional Torus
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
An Adaptive Submesh Allocation Strategy for Two-Dimensional Mesh Connected Systems
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Resource allocation and utilization in the Blue Gene/L supercomputer
IBM Journal of Research and Development
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Hi-index | 0.01 |
The processing elements of many modern tightly coupled multicomputers are connected via mesh or toroidal networks. Such interconnects are simple and highly scalable, but suffer from high fragmentation, low utilization, and insufficient fault tolerance when the resources allocated to each job are dedicated. High dimension interconnects may be more efficient in certain cases, but are based on complex and expensive components, and scale poorly. We present a novel hardware/software architectural approach that detaches the processing elements of the system from the interconnect and augments the traditional toroidal topology to provide additional connectivity options and additional link redundancy. We explore the properties of the new "multi-toroidal" topology and the improvements it offers in resource utilization and failure tolerance. We present the results of extensive simulation studies to show that for practically important types of workloads the resource utilization may be increased by 50%, and in certain cases as much as 100% compared to toroidal machines, and is, in fact, close to the theoretically optimal case of a full crossbar interconnect. The combined hardware/software architectural innovation is a major significant improvement in resource utilization on top of the state of the art in scheduling algorithm research. Also, multi-toroidal multicomputers are able to work under link failure rates of 0.002 failures per week that would shut down toroidal machines. A variant of multi-toroidal architecture is implemented in the Blue Gene/L supercomputer.